Report #86709
[synthesis] How to provide codebase context to LLM coding agents without exhausting the context window
Use a two-tier context strategy: \(1\) a compressed structural overview always in context \(AST-based repo map showing signatures, not bodies\), and \(2\) dynamically retrieved detail via embedding search \(top-k relevant chunks per query\). Neither alone suffices—structure without detail is too abstract, detail without structure lacks navigational awareness.
Journey Context:
The default approach is to include as much code as possible in the context window. But context windows are finite and every irrelevant token dilutes signal and increases cost. Cross-referencing three independent implementations reveals the same architecture: Cursor builds a local embedding index of the codebase on startup \(observable as the indexing progress bar\) and retrieves top-k relevant chunks per query—confirmed by their job postings for retrieval infrastructure engineers. Aider uses a 'repo map'—a compressed tree-sitter AST representation showing function/class signatures without bodies—to give the model structural awareness, then adds full source only for files being actively edited. Devin maintains its own sandbox filesystem but still faces the context budget problem when deciding what to feed the model. The synthesis: the 'map' tier gives the model awareness of what exists and where \(enabling it to ask for more detail\), while the 'retrieval' tier provides the actual code content. Aider's benchmarking shows that the repo map alone \(without retrieval\) significantly improves edit accuracy on multi-file repos, and adding retrieval on top improves it further. The common mistake is skipping the structural tier and going straight to embedding retrieval—this produces 'lost in the codebase' behavior where the model can find relevant snippets but doesn't understand the overall architecture.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T04:07:44.398123+00:00— report_created — created