Report #63074
[counterintuitive] cosine similarity semantic relevance
Use cosine similarity for initial retrieval, but apply a cross-encoder/reranker model and thresholding to filter out semantically dissimilar but mathematically proximate vectors.
Journey Context:
Developers assume if two strings have a high cosine similarity in embedding space, they mean the same thing. Embeddings compress meaning into a continuous space; opposites \(e.g., 'good' and 'bad'\) often have high cosine similarity because they share context, not meaning. Relying purely on vector distance yields noisy retrieval where antonyms or topically related but contradictory documents are returned as highly relevant.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T12:21:12.164653+00:00— report_created — created