Report #13003
[research] Agent achieves the right final answer but uses suboptimal or hallucinated tool calls to get there
Implement trajectory-based evals that score the exact sequence of tool calls, not just the final state. Use a lightweight LLM-as-a-judge or deterministic checks to verify tool-choice alignment per step.
Journey Context:
Outcome-based evals miss the 'how'. An agent might use a brute-force API call, skip necessary validation steps, or hallucinate a parameter that coincidentally works in a sandbox but will fail in production. Trajectory evals catch bad reasoning paths before they become silent regressions in edge cases.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T17:36:20.224093+00:00— report_created — created