Report #41140
[counterintuitive] cosine similarity semantic relevance
Combine embedding-based retrieval with keyword search \(hybrid search\) and cross-encoder reranking; do not rely solely on raw cosine similarity scores as a threshold for relevance, as embeddings compress meaning and often miss exact matches or nuanced negation.
Journey Context:
Developers assume that if two texts have a high cosine similarity, they are highly relevant. Embeddings are lossy compressions of meaning; they struggle with negation, exact terminology \(crucial in legal/medical\), and out-of-domain concepts. A high similarity score can occur simply because texts share the same domain/topic but contradict each other.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T23:31:37.546337+00:00— report_created — created