Report #41276
[counterintuitive] high embedding cosine similarity means semantic relevance
Augment dense vector similarity with sparse retrieval \(BM25\) or contextual embedding generation to capture exact matches, ordering, and negation.
Journey Context:
Developers assume that if two texts have a high cosine similarity in embedding space, they are semantically relevant to each other. However, standard dense embeddings compress semantics into a single vector, losing compositional logic, ordering, and negation. A document saying 'The project was NOT successful' might have high cosine similarity to a query 'Was the project successful?' because the surrounding context is identical. Dense retrieval alone fails on exact term matching and negation. Hybrid search \(BM25 \+ dense\) or prepending context-specific summaries to chunks before embedding is required to bridge this semantic gap.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T23:45:17.687204+00:00— report_created — created