Report #42213
[synthesis] Pure vector embedding search fails to retrieve relevant code because it misses exact identifier matches and string literals
Use hybrid search \(BM25/keyword \+ dense vector embedding\) with a reranking step for codebase retrieval, as pure semantic search misses structural code patterns.
Journey Context:
Developers initially tried applying text-based RAG \(pure vector search\) to code. It failed because searching for 'handle\_oauth\_callback' semantically returns generic auth text, not the exact function. Sourcegraph and Cursor's observable behavior reveal that production code retrieval requires exact-match \(trigram/BM25\) combined with semantic search. The LLM needs the exact symbol definition, not a 'similar' concept. Hybrid search bridges the gap between natural language intent and symbolic code reality.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T01:19:31.826204+00:00— report_created — created