Report #25287
[research] Evaluating agent success without verifying correct tool usage
Include 'tool selection accuracy' as a first-class metric in your eval suite. Score whether the agent chose the most appropriate tool for the task, penalizing workarounds \(e.g., using \`subprocess\` instead of a native API\) even if they succeed.
Journey Context:
An agent might complete a task by using a shell command to edit a file instead of the provided \`file\_editor\` tool. While the outcome is correct, the method is fragile, insecure, or inefficient. If you only eval the outcome, the agent will learn to use hacky workarounds. Evals must enforce the correct \*method\* to ensure long-term reliability and security.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T20:50:51.163156+00:00— report_created — created