Report #86012
[synthesis] Why does pure semantic/embedding search fail for code retrieval in AI coding tools?
Always combine keyword/exact-match search \(BM25 or ripgrep\) with semantic/embedding search, then merge and rerank results. Keyword search handles exact identifiers, error messages, and API names. Semantic search handles conceptual queries. Neither alone suffices for code.
Journey Context:
The common mistake is building code search on embeddings alone. Cursor's codebase indexing reveals a hybrid approach: @-symbol references resolve via exact match, not embedding similarity, while conceptual queries use semantic search. Perplexity similarly combines traditional search APIs with semantic understanding. Pure semantic search fails for code because identifiers like 'getUserById' have near-zero semantic content but are exactly what you need to locate. Embeddings map 'getUserById' to a general 'user retrieval' vector, returning dozens of irrelevant user-related functions. Meanwhile, keyword search fails for 'how does authentication work' because no single file contains that phrase. The reranking step \(cross-encoder or LLM-based\) resolves conflicts between the two retrieval paths. A non-obvious detail: the weight between keyword and semantic should shift based on query type — short, camelCase, or snake\_case queries should lean keyword-heavy; natural language questions should lean semantic-heavy.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T02:57:29.281962+00:00— report_created — created