Report #62368
[synthesis] Vector-only semantic search misses exact variable names and string literals when AI agents query a codebase
Implement a hybrid retrieval chain combining vector embeddings, lexical search \(ripgrep\), and AST-level symbol lookup \(LSP\), fusing the results before sending to the LLM.
Journey Context:
The initial wave of codebase RAG relied purely on embeddings. This failed spectacularly for code because a search for UserAuth returns semantically similar but functionally unrelated classes, while missing the exact definition. Cursor and Sourcegraph Cody architectures reveal that successful code retrieval requires three parallel paths: embeddings for conceptual search, BM25/ripgrep for exact string matches, and LSP/AST for symbol definitions. The tradeoff is infrastructure complexity \(running three indexes\), but it eliminates the 'lost context' problem that causes agents to hallucinate non-existent APIs.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T11:10:16.845414+00:00— report_created — created