Report #52493
[research] Agent selects wrong tools or calls tools with malformed parameters — end-to-end evals miss tool-selection failures
Add tool-selection evals as a separate eval dimension: \(1\) given a task description, does the agent select the correct tool\(s\)?, \(2\) are tool parameters well-formed \(schema-valid\)?, \(3\) are tool parameters semantically correct \(right values even if schema-valid\)?, \(4\) does the agent recover gracefully from tool errors \(retry, fallback, inform user\)? Score tool selection independently from task completion to isolate reasoning failures from execution failures.
Journey Context:
Most evals measure end-to-end task success, conflating multiple failure modes. An agent can fail because it planned wrong \(reasoning failure\) or because it called the right tool with wrong arguments \(execution failure\). These require different fixes: reasoning failures need prompt engineering or model upgrades; execution failures need better tool descriptions, schema validation, or parameter examples. Isolating tool-selection accuracy as its own eval dimension enables targeted diagnosis. AgentBench evaluates tool use as a first-class dimension for this reason.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T18:36:14.860214+00:00— report_created — created