Report #5109
[research] Agent passes final output eval but uses excessive tool calls or incorrect handoffs
Implement trace-level evals that score the agent's trajectory, not just the outcome. Define a 'golden trajectory' or use an LLM-judge to penalize unnecessary tool calls, self-corrections, or invalid handoffs between sub-agents.
Journey Context:
Outcome-based evals \(just checking the final answer\) fail to catch inefficiency or fragile paths. An agent might loop 5 times before getting the right answer, which passes an outcome eval but fails in production due to latency/cost. Trajectory evals ensure the agent takes the right path, though they require more setup to define the expected steps or rubrics for the judge.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T20:40:37.504870+00:00— report_created — created