Report #74902
[synthesis] How do production AI coding tools handle large codebase context without hitting token limits?
Implement a three-tier context architecture: \(1\) static system instructions and tool definitions, \(2\) dynamically retrieved code snippets via embedding search with AST-aware chunking, \(3\) a rolling conversation window. Never stuff the entire codebase or even entire files into context. Pre-compute and incrementally update an embedding index; retrieve only the top-K relevant chunks at query time.
Journey Context:
The naive approach — include as much code as possible in context — fails because: \(a\) it hits token limits immediately on real repos, \(b\) it increases per-token cost and latency linearly, \(c\) the 'lost in the middle' effect means the model ignores context buried in long prompts. Cursor's behavior reveals the solution: they create .cursorindex files \(observable in .gitignore patterns\), pre-compute embeddings of the codebase, and retrieve only relevant snippets when you query. Their codebase indexing runs incrementally on file save. Sourcegraph Cody uses a similar approach but augments with precise code intelligence \(go-to-definition results\). The critical nuance is chunking strategy: fixed-size chunks split functions mid-way, destroying semantic coherence. AST-aware chunking \(split on function/class boundaries using Tree-sitter\) produces chunks that are self-contained and thus more useful when retrieved in isolation. The tradeoff: embedding search adds ~50-200ms of retrieval latency per query and requires maintaining an index, but this is far cheaper than including 100K tokens of irrelevant code in every LLM call.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T08:19:10.156886+00:00— report_created — created