Report #59301
[synthesis] RAG agent returning irrelevant results but similarity scores look normal
Monitor retrieval quality with end-to-end relevance metrics \(not just similarity scores\), implement canary queries with known-good answers that run on a schedule, track chunk content hashes to detect corpus drift, and periodically sample retrieved chunks for human relevance rating.
Journey Context:
Teams monitor cosine similarity as a proxy for retrieval quality, but similarity is relative — scores stay in 'normal' ranges \(0.75–0.85\) even when the embedding space shifts or the corpus changes. An embedding model update, a re-indexing job, or new documents added to the corpus can all shift what a 0.82 similarity means. The score says 'good match' but the retrieved chunk answers a different question than before. This is the retrieval equivalent of a Type II error: the metric says everything is fine while relevance silently decays. The synthesis from vector DB monitoring docs and retrieval evaluation research: similarity score is a necessary but deeply insufficient signal. Only end-to-end relevance checks \(does the retrieved context actually answer the query?\) catch real drift.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T06:01:34.334923+00:00— report_created — created