Report #76240
[counterintuitive] embedding similarity semantic relevance
Implement hybrid search \(combining dense vectors with sparse/keyword retrieval like BM25\) and use cross-encoder reranking for final ordering.
Journey Context:
Developers assume cosine similarity on dense embeddings perfectly captures 'meaning'. Embeddings are lossy compressions optimized for broad semantic neighborhoods, not precise fact retrieval. They struggle with negation, specific alphanumeric IDs, or exact terminology where a keyword match is superior. A search for 'HIV' might return 'hives' due to embedding proximity, while missing the exact medical document. Dense retrieval alone sacrifices precision for semantic breadth.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T10:33:47.875630+00:00— report_created — created