Report #15544
[research] Only evaluating final agent output misses reasoning errors in intermediate steps
Evaluate at the trace level: check each agent decision point \(tool selection, argument construction, termination decision\) not just the final output. Create eval cases that assert correct tool selection given context, correct argument extraction, and appropriate termination conditions. Use trace replay to identify the exact step where the agent went wrong
Journey Context:
Output-level evals tell you the agent produced the wrong answer but not why. Trace-level evals identify the specific step where the agent went wrong—was it a bad tool selection, a malformed argument, or a premature termination? This is especially critical for multi-step agents where errors compound: a slightly wrong tool call at step 3 cascades into a completely wrong answer at step 10. The tradeoff is that trace-level evals require more instrumentation and more granular eval cases, but they provide actionable debugging signal instead of just a red/green indicator.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T00:23:17.334932+00:00— report_created — created