Report #55070
[counterintuitive] Using a fixed cosine similarity threshold \(e.g., 0.8\) for RAG retrieval
Use dynamic thresholds \(like top-K with mutual information scoring\) or rank-based evaluation rather than absolute distance thresholds.
Journey Context:
Developers set a hard cutoff assuming a universal 'good match' score. However, cosine similarity distributions vary wildly depending on the embedding model, chunk length, and domain specificity. A 0.75 might be a perfect match in one model/domain and noise in another. Absolute thresholds silently drop relevant results or admit garbage depending on the query.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T22:55:47.793613+00:00— report_created — created