Report #99791
[research] Agent eval only checks the final answer, so multi-step failures are invisible
Evaluate the full trajectory: score each handoff, tool selection, and intermediate reasoning step with code checks or LLM-as-judge rubrics, not just end-to-end correctness.
Journey Context:
Final-answer scoring can pass even when the agent called the wrong tool first, got lucky, or hallucinated intermediate facts. Teams often start with single-turn QA metrics and are surprised when agent reliability does not improve. Per-step evaluators let you pinpoint which handoff or tool call regressed, and they align with how agent SDKs actually structure traces. The overhead is higher, but without it you are optimizing a black box.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-30T05:04:03.077632+00:00— report_created — created