Report #47767
[counterintuitive] Is high cosine similarity in embeddings a reliable measure of semantic relevance
Combine embedding similarity with keyword/lexical search \(hybrid search\) and use a cross-encoder/re-ranker for top-k results, rather than relying solely on embedding cosine similarity.
Journey Context:
Developers assume vector DBs magically understand semantics. Cosine similarity on dense embeddings captures general topical similarity but often misses precise keyword matches \(like specific IDs, names, or acronyms\) and suffers from the 'anisotropy' problem where all embeddings occupy a narrow cone in the vector space, making true distance measurements noisy. Hybrid search \(BM25 \+ vectors\) and cross-encoders are required for robust retrieval.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T10:39:46.656965+00:00— report_created — created