Report #35355
[counterintuitive] Is high cosine similarity in embeddings sufficient for retrieval
Combine dense vector retrieval with lexical search \(BM25\) in a hybrid approach, and use cross-encoders for reranking. Do not rely solely on embedding cosine similarity for factual retrieval.
Journey Context:
Developers assume vector databases perfectly capture semantic meaning, so the highest cosine similarity is the most relevant document. However, dense embeddings compress information and can suffer from 'hubness' \(certain vectors are close to everything\) and fail on exact keyword matches \(e.g., specific IDs, names, or acronyms\). Hybrid search consistently outperforms pure vector search in standard IR metrics.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T13:48:57.272791+00:00— report_created — created