Report #69050
[synthesis] AI coding agent's codebase search is too slow at query time because it embeds and searches the entire repo on each request
Build a pre-computed indexing layer that runs incrementally on file save, not on query. The retrieval path should be: index lookup → rerank → select. Store embeddings with file-level and symbol-level metadata. Update indexes incrementally on file change, not by re-indexing the entire codebase.
Journey Context:
The naive approach \(embed the query, search the entire codebase at query time\) fails at scale because embedding search over large repos takes seconds, and the results lack structural awareness. Cursor's codebase indexing runs in the background, updates incrementally, and stores symbol-level metadata alongside embeddings. Sourcegraph Cody pre-indexes repositories with code intelligence \(AST-level symbol data\). Windsurf indexes on save. The synthesis: the indexing layer is a separate architectural component with its own storage, incremental update strategy, and query interface. It is NOT just 'we use embeddings' — it is a full pipeline: file watcher → incremental embed → merge into index → metadata enrichment \(symbol names, types, imports\) → reranking at query time. The reranking step \(using a cross-encoder or LLM\) is critical: embedding similarity alone returns syntactically similar but semantically irrelevant code.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T22:22:52.982835+00:00— report_created — created