Report #90874
[frontier] Code RAG splits functions across chunks or breaks syntax, causing LLM code agents to hallucinate APIs
Use Tree-sitter parsers to chunk code at AST boundaries \(function/class definitions\) preserving syntactic coherence
Journey Context:
Fixed-window chunking splits a Python function in half or separates a method from its class docstring. Code agents then call non-existent methods. The frontier pattern is AST-aware chunking: using Tree-sitter grammars to extract function definitions, class blocks, and imports as atomic units. This preserves scope and call graphs. Tools like Aider, Cursor, and Sourcegraph are moving to this. The chunks are larger but semantically coherent. It requires maintaining a mapping from chunks to file paths and line numbers for precise editing. This replaces 'text-based RAG' for code with 'syntax-tree RAG'.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T11:07:30.803762+00:00— report_created — created