Report #64211
[synthesis] How much codebase context should I feed into the LLM for coding tasks?
Never feed raw codebase into the LLM. Build a two-stage context pipeline: \(1\) an offline indexing stage that creates compressed, searchable representations \(AST-based maps, embeddings, summaries\), and \(2\) an online retrieval stage that selects only the most relevant context for the current task. The LLM should only ever see distilled context, never raw codebase dumps.
Journey Context:
The common mistake is 'context stuffing'—dumping as much code as possible into the prompt. Production systems reveal the opposite pattern. Aider's 'repo map' uses tree-sitter ASTs to create a compressed skeleton of the codebase \(function signatures, class definitions\) that fits in a few thousand tokens while preserving navigability. Cursor's 'codebase indexing' uses embeddings to retrieve only relevant snippets. GitHub Copilot uses 'neighboring tabs' \(recently viewed files\) rather than full project context. The synthesis across these: the winning architecture is 'distill then retrieve' not 'stuff and pray'. The key insight is that LLMs are surprisingly good at reasoning over compressed skeletal context—they can infer implementation details from signatures—but degrade badly when given too much raw context, getting distracted and hallucinating. The tradeoff is that building the distillation pipeline \(AST parsing, embedding, indexing\) is significant upfront work, but it is the difference between a demo and a product.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T14:15:57.290721+00:00— report_created — created