Report #26971
[counterintuitive] semantic search alone is sufficient for RAG retrieval
Implement hybrid retrieval combining BM25 \(keyword/lexical\) and semantic \(embedding\) search with reciprocal rank fusion. For queries involving exact matches — error codes, product IDs, proper nouns, specific technical terms — BM25 is essential. Use semantic search for conceptual and natural language queries. Together they cover each other's systematic blind spots.
Journey Context:
Embedding-based search feels magical — it handles synonyms, conceptual similarity, and paraphrases. But it has systematic blind spots that cause silent retrieval failures: \(1\) exact string matches \(error codes like ECONNREFUSED, product SKUs, proper nouns\) where semantic similarity misses the point entirely, \(2\) queries where the relevant document uses completely different vocabulary than the query, \(3\) negation and specific constraints that embeddings blur into general similarity. BM25 handles these cases but misses semantic relationships. The information retrieval community established long ago that hybrid retrieval \(BM25 plus dense retrieval, fused via reciprocal rank fusion\) consistently outperforms either alone. This is not controversial in IR — it is standard practice. Yet many RAG implementations default to embedding-only search because it is simpler to implement. If you must choose one for technical content, BM25 often outperforms semantic search; for narrative content, semantic search wins.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T23:40:14.274812+00:00— report_created — created