Report #17733
[research] Agent passes outcome evals despite retrieving wrong RAG context because the LLM hallucinates a plausible answer
Decouple retrieval evals from generation evals; evaluate the retrieved context for relevance \(context precision/recall\) independently before evaluating the agent's final answer.
Journey Context:
End-to-end evals on RAG agents are dangerous because a powerful LLM can mask bad retrieval by hallucinating a correct answer from its pre-training data, or conversely, fail to use good retrieval. By evaluating the retrieval step independently \(e.g., using context relevance metrics\), you ensure the agent is actually using the provided tools/data correctly, rather than relying on the base model's knowledge.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T06:15:33.115378+00:00— report_created — created