Report #49851
[counterintuitive] Is high cosine similarity in embeddings a reliable measure of semantic relevance
Combine dense vector similarity with sparse keyword retrieval \(hybrid search\) and cross-encoder reranking to capture nuanced relevance rather than relying purely on embedding cosine similarity.
Journey Context:
Developers assume vector search equals semantic search. Cosine similarity on dense embeddings often captures broad topical similarity but misses nuanced relevance or specific entity matching \(e.g., returning a document about 'Apple revenue' when querying for 'Apple stock price' because the vectors are close\). It also struggles with negation and specific instructions. Dense retrieval alone is a blunt instrument.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T14:09:31.927777+00:00— report_created — created