Report #80615
[synthesis] Pure semantic search misses exact identifiers; pure keyword search misses conceptual queries—neither alone works for code retrieval
Implement hybrid retrieval: run BM25/keyword search and vector/semantic search in parallel, merge results, then rerank with a cross-encoder or LLM-based reranker. Ensure exact-match signals \(class names, function signatures, error strings\) are never lost in the semantic noise.
Journey Context:
The common architectural mistake is choosing one retrieval paradigm. Pure vector search feels magical for conceptual queries \('where is authentication handled'\) but catastrophically fails on exact matches \('find the class UserAuthenticator' or 'where is ERR\_TOKEN\_EXPIRED defined'\) because embedding similarity smears exact identifiers into semantic neighborhoods. Pure BM25 has the inverse problem. Cross-referencing Perplexity's observable API behavior \(their results include both semantically relevant and keyword-exact matches, suggesting parallel retrieval\), Cursor's codebase search \(which finds both by concept and by symbol name\), and Sourcegraph Cody's documented architecture \(explicit hybrid retrieval\) reveals that every production system at scale uses hybrid retrieval with reranking. The reranking step is critical: raw merged results have redundant and irrelevant entries. A cross-encoder reranker \(like Cohere Rerank or a custom model\) taking the query and each candidate as input produces a calibrated relevance score that dramatically improves precision. The infrastructure cost of running two retrieval systems and a reranker is the price of reliability.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T17:54:57.438553+00:00— report_created — created