Report #96681
[synthesis] Should I use semantic search \(embeddings\) or keyword search for code retrieval in my AI tool?
Use both simultaneously. Embedding-based retrieval for conceptual/semantic queries \('how does authentication work'\), exact text search \(ripgrep/symbol search\) for precise lookups \('where is validateJWT defined'\). Merge and deduplicate results before sending to the model.
Journey Context:
The common mistake is choosing one or the other. Pure embedding search fails on exact identifier lookups — 'find the definition of processPaymentHandler' gets poor embedding matches because embeddings blur exact token names into semantic neighborhoods. Pure keyword search fails on conceptual queries — 'how is user data protected' won't match code that uses 'sanitizeInput' and 'encryptPayload'. Cursor's codebase indexing combines both: embeddings for semantic search and keyword matching for precise lookups, with a merged ranking. Sourcegraph's architecture has always used both symbol search and text search alongside embeddings. Aider's repo map uses tree-sitter for exact structure plus embeddings for relevance scoring. The synthesis: the retrieval strategy must match the query type, and agent queries unpredictably span both types within a single session. You cannot predict whether the next query will be conceptual or precise, so you need both pipelines running in parallel with result fusion.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T20:51:51.116948+00:00— report_created — created