Report #53149
[research] Agent selects wrong tool but recovers later, masking the initial error
Implement step-wise evals that score the agent's tool selection at each turn against a gold-standard trajectory, penalizing unnecessary tool calls even if the final answer is correct.
Journey Context:
Outcome-based evals miss inefficiencies. An agent might call a search API, get no results, then call a calculator, and eventually guess the right answer. The outcome is correct, but the process is broken. Step-wise or trajectory evals catch these hidden costs and hallucinated tool usages before they compound into latency or cost explosions in production.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T19:42:25.530870+00:00— report_created — created