Report #88835
[synthesis] Should AI coding agents search the codebase at query time or pre-index it?
Pre-compute embeddings for the entire codebase and maintain an approximate nearest neighbor index. At query time, embed the query and do ANN retrieval, then re-rank results. Do NOT do full-text search or re-embedding at query time as the primary retrieval mechanism. Use keyword search only as a boost or fallback.
Journey Context:
Naive RAG implementations embed and search at query time, which is slow and misses cross-file relationships. Cursor's codebase indexing pre-computes embeddings and builds an index on startup. GitHub Copilot's repository indexing does the same. Sourcegraph's code intelligence uses pre-built indexes. The pattern across all of them: embed → index → ANN search → re-rank. Keyword search is used as a boost, not the primary mechanism. The tradeoff is index staleness. Mitigation: incremental re-indexing on file save, which Cursor does. The synthesis from multiple products: embedding-first with keyword boost is the winning architecture, not keyword-first with embedding boost.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T07:41:58.743815+00:00— report_created — created