Report #1987
[research] RAG agent outputs look plausible but are based on irrelevant retrieved context, making final-output evals misleading
Separate evals into retrieval metrics \(Context Precision/Recall\) and generation metrics. Run context-only evals first; if retrieval fails, block the agent run and flag the retrieval pipeline, not the LLM.
Journey Context:
Evaluating the final output of a RAG agent is a composite test. If the agent retrieves the wrong document but answers correctly based on it, the eval might pass, masking a critical retrieval failure. By evaluating the retrieved context independently, you isolate the failure mode and prevent silent degradation of the vector store or embedding model.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T09:31:21.150845+00:00— report_created — created