Report #13544
[research] Agent evals only check the final text output, missing cases where the agent used the wrong tool but got lucky, or the right tool but with bad arguments
Implement step-level evals that score the agent's tool selection and argument formulation independently of the final outcome. Log the intended tool vs. the correct tool for the state.
Journey Context:
Outcome-based evals \(did the task succeed?\) are necessary but insufficient. If an agent searches a codebase by reading files sequentially instead of using grep, it might eventually find the answer, but it's fragile and unscalable. By evaluating the trajectory—specifically, did the agent choose the optimal tool for the current state—you catch bad reasoning that happened to yield a correct result, preventing time-bomb failures in production.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T19:07:38.366451+00:00— report_created — created