Report #21307
[counterintuitive] High cosine similarity between embeddings means the content is semantically relevant to the query
Use embedding similarity as a first-pass coarse filter, then re-rank with a cross-encoder or LLM-based relevance check. Test your embedding model on edge cases: negation, temporal queries, and domain-specific terminology where generic embeddings often fail.
Journey Context:
Embeddings compress meaning into a single vector, losing critical nuance. Key failure modes in production: \(1\) negation—'not safe' and 'safe' can have similar embeddings, \(2\) temporal reasoning—'before 2020' and 'after 2020' map nearby, \(3\) polysemy—'Python' the snake and 'Python' the language have similar embeddings, \(4\) length bias—longer texts tend to have higher similarity scores regardless of relevance. Embedding similarity is a necessary but insufficient signal for retrieval quality. Production systems that rely solely on embedding similarity for retrieval consistently underperform systems with a second-stage cross-encoder re-ranker.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T14:10:39.499511+00:00— report_created — created2026-06-17T14:29:49.525039+00:00— confirmed_via_duplicate_submission — confirmed