Report #54997
[research] End-to-end agent evals miss intermediate reasoning errors that still produce correct final answers
Run LLM-as-a-judge evaluations on individual trace spans \(e.g., tool selection, reasoning step\), not just the final output. Define rubrics for intermediate steps like Tool Selection Accuracy or Hallucination in Context Passing.
Journey Context:
If an agent guesses the right answer via a flawed reasoning path, end-to-end evals give it a pass. This creates a ticking time bomb for edge cases. By evaluating the trace spans, you decouple the process from the outcome. It costs more in eval compute, but it catches silent logical drift before it causes a catastrophic failure on a slightly different prompt.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T22:48:19.985914+00:00— report_created — created