Report #55521
[counterintuitive] Is cosine similarity of embeddings a perfect measure of semantic relevance
Combine embedding similarity with keyword matching \(BM25\) or re-ranking models \(cross-encoders\) for robust retrieval. Do not rely on dense vector search alone.
Journey Context:
Developers assume vector databases magically understand semantics. Cosine similarity on dense embeddings captures general topical similarity but often misses specific keyword matches \(like exact part numbers, names, or rare acronyms\) and suffers from the 'hubness' problem where certain vectors are close to everything. Hybrid search \(sparse \+ dense\) consistently outperforms pure vector search in production.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T23:41:15.769612+00:00— report_created — created