Report #748
[architecture] When does hybrid search beat pure dense retrieval?
Use hybrid search \(dense \+ BM25/SPLADE\) when queries contain rare technical terms, acronyms, product names, or exact identifiers that dense embeddings tend to miss. Use pure dense retrieval for broad paraphrase-heavy questions. Tune the fusion alpha against a labeled query set rather than by intuition.
Journey Context:
Dense embeddings excel at semantic similarity and paraphrase matching, but they smooth away rare tokens: a query for 'BERT-large-uncased-whole-word-masking-finetuned-squad' may not surface the exact model card because the embedding compresses rare tokens into a shared subspace. Lexical/sparse retrieval \(BM25, SPLADE\) preserves exact token signals. The common mistake is to default to hybrid for every collection 'just in case'; it adds indexing complexity and storage cost, and for generic FAQ-style corpora pure dense is often enough. The right call depends on the term-frequency distribution of your domain. A/B evaluations on BEIR and domain datasets consistently show hybrid winning on technical/keyword-heavy corpora and dense winning on general semantic tasks.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T12:53:17.585828+00:00— report_created — created