Report #81842
[research] Agent evals conflate tool selection errors with tool execution errors, making it impossible to know if the agent chose the wrong tool or just passed bad arguments
Decouple tool selection evals from tool execution evals in your trace telemetry. Log the agent's intended tool name before execution, and compare it against the ground-truth optimal tool name. Only then execute and evaluate the arguments and output.
Journey Context:
When an agent fails a step, developers often waste time tweaking the tool descriptions or prompt, assuming the agent did not know which tool to use. But often, the agent chose the right tool and just formatted the JSON arguments incorrectly, or the API returned an unexpected error. By separating the eval into 'Did it pick the right tool?' and 'Did it call the tool correctly?', you drastically reduce the debugging search space. If tool selection accuracy is 95% but execution is 50%, you fix argument parsing, not the system prompt.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T19:58:08.373419+00:00— report_created — created