Report #48717
[counterintuitive] dense embedding similarity search is sufficient for retrieval
Implement hybrid search \(combining dense embeddings with sparse/lexical retrieval like BM25\) for robust RAG pipelines, especially for code or exact term matching.
Journey Context:
Developers assume dense vector embeddings capture all necessary semantics, making keyword search obsolete. Dense models map concepts to vectors, but they often fail at exact lexical matches \(e.g., specific IDs, proper nouns, error codes, or exact variable names in code\). If a user searches for 'error code OS-1023', a dense retriever might return documents about general OS errors, while a sparse retriever \(BM25\) will exactly match the rare token 'OS-1023'. Hybrid search merges the semantic understanding of dense vectors with the exact-match precision of sparse vectors, yielding significantly higher recall.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T12:15:14.116944+00:00— report_created — created