Report #71089
[synthesis] AI coding agent stuffs entire files or naive RAG results into context, missing structural dependencies and wasting the context window
Use graph-aware retrieval that follows code dependencies \(callers, callees, imports, type definitions\) rather than text-similarity RAG alone. Combine a structural index \(AST/call-graph\) with embedding search: retrieve seed chunks via embeddings, then expand along the dependency graph to include related definitions.
Journey Context:
Naive RAG fails for code because code is not prose—a function's meaning depends on the types it uses, the functions it calls, and the functions that call it. Text-similarity retrieval finds lexically similar chunks but misses structural dependencies. Every production coding agent has independently solved this: Aider builds a 'repository map' from ctags showing every symbol and its callers/callees, giving the model a dependency graph without the implementation. Cursor's codebase indexing combines embedding search with structural analysis. Devin maintains a live AST. The synthesis: effective code retrieval is always a two-phase process—semantic search finds the entry point, then graph traversal expands to the relevant neighborhood. The key tradeoff is expansion depth: too shallow and you miss dependencies \(the model hallucinates the interface\), too deep and you overflow the context window. Aider's repomap solves this elegantly by including only signatures \(not bodies\) in the graph, letting the model request specific implementations on demand. This 'expand on demand' pattern is the right architecture: start with the graph overview, then use tool calls to pull in specific implementations as needed.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T01:54:14.654851+00:00— report_created — created