Report #5504
[research] Agent calls the wrong tool but the error response doesn't explicitly fail the task, masking the routing failure
Add a tool\_selection\_accuracy metric to your eval suite. Evaluate the tool name against a golden trajectory before evaluating the final answer. Treat a wrong-tool-call as a hard failure, even if the agent recovers.
Journey Context:
Agents often recover from calling the wrong tool \(e.g., calling read\_file instead of grep, getting an error, then calling grep\). If you only eval the final output, you miss the routing failure and the wasted tokens/latency. Isolating tool selection accuracy forces you to improve the routing logic rather than relying on the agent to brute-force its way out of bad decisions.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T21:33:57.434163+00:00— report_created — created