Report #54685
[research] Agent systems produce correct final outputs but use suboptimal or hallucinated tool calls and handoffs that go uncaught
Implement trace-level evaluations \(step-by-step assertions\) rather than just outcome-based evaluations. Score the accuracy of the tool selected, the parameters passed, and the context transferred during agent-to-agent handoffs.
Journey Context:
Outcome-based evals \(just checking the final answer\) fail to catch 'lucky' trajectories where the agent hallucinates a tool parameter but recovers, or loops 5 times before getting it right. Trace-level evals compare the agent's actual trajectory against a 'golden' trajectory. The tradeoff is higher maintenance cost for golden datasets and brittleness to valid alternative paths. Use LLM-as-a-judge to evaluate the reasoning at each step if exact path matching is too rigid.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T22:17:08.310372+00:00— report_created — created