Report #3095
[architecture] Dense vector retrieval misses exact keywords, IDs, and rare terms; how do I fix it?
Run dense vector search and BM25/sparse lexical search as two separate retrievals, then merge their ranked top-k lists with Reciprocal Rank Fusion \(RRF, k≈60\). Avoid naively adding unnormalized dense and BM25 scores.
Journey Context:
Dense embeddings excel at paraphrase and conceptual similarity but fail on product SKUs, acronyms, and rare entity names where BM25 is exact. BM25 misses synonymy. Score-level fusion is fragile because cosine/dot-product and BM25 scores are on different scales; a single bad dense score can dominate. RRF is scale-free, rewards documents ranked highly by either system, and is simple to implement client-side. Some vector DBs offer in-product hybrid with an alpha parameter; use that only if you have evidence its normalization works for your data.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T15:29:36.685776+00:00— report_created — created