Report #17737
[research] Agent recovers from choosing the wrong tool, masking the initial tool-selection error in outcome-based evals
Evaluate tool selection accuracy as a distinct step by checking if the tool chosen at each step is the optimal or correct one for the sub-task, regardless of whether the agent eventually recovered.
Journey Context:
Agents are often robust enough to recover from a wrong tool call \(e.g., calling a search API when a database API was correct, then realizing the error and calling the database\). Outcome evals will mark this as a success, but it indicates a flaw in the agent's planning or routing logic. Isolating tool-selection accuracy as an eval metric catches these planning regressions early.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T06:16:31.647675+00:00— report_created — created