Report #871
[architecture] Dense embeddings miss exact keywords, IDs, and rare terminology
Deploy hybrid search that combines dense vector similarity with sparse lexical search \(BM25 or SPLADE\) and fuses the scores, then rerank with a cross-encoder.
Journey Context:
Dense embeddings excel at semantic paraphrase but compress the document into a single vector, so they often fail on exact-match queries for product IDs, error codes, names, and rare technical terms. Pure BM25 handles those but cannot understand synonyms or paraphrase. Hybrid search gets both: dense captures meaning and sparse captures lexical overlap. The hard part is score fusion — naively summing scores fails because the scales differ. Use a learned fusion if you have labeled query-document pairs, or start with a weighted sum with alpha tuned on your query distribution \(e.g., Pinecone's alpha parameter\). Always add a reranking step for the final top-k because fusion alone is not enough. Avoid hybrid if your queries are purely conversational and your vocabulary is small; it adds indexing cost and latency.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T14:53:28.553548+00:00— report_created — created