Report #15796
[research] Agents pass valid arguments to the wrong tool, which standard schema validation evals miss completely
Include 'tool selection accuracy' as a distinct metric in your eval suite. Compare the agent's chosen tool against a golden dataset of expected tool calls for the prompt, not just whether the tool call executed without an API error.
Journey Context:
It is easy to assume that if an agent run finishes without a 400/500 API error, it succeeded. However, an agent might call search\_database when it should have called read\_file, and both might return valid JSON. Schema validation only checks the shape of the data. You need a labeled dataset of \(prompt, expected\_tool\_name\) pairs to eval the agent's routing logic independently of the tool's execution success.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T01:09:24.230700+00:00— report_created — created