Report #17136
[research] Agent evals fail to distinguish between bad tool selection and bad tool inputs
Separate evals into two distinct dimensions: 1\) Tool Selection Accuracy \(did it pick the right API?\) and 2\) Argument Schema Compliance \(did it pass the right parameters?\). Use mock tools to isolate selection from execution.
Journey Context:
When an agent fails a task, it is often unclear if the LLM did not understand \*which\* tool to use, or if it knew the right tool but formatted the JSON payload incorrectly. Blending these into one 'task success' metric makes debugging impossible. By mocking the tool execution, you isolate the LLM's planning/reasoning \(selection\) from its structural formatting \(argument compliance\), allowing you to fix the exact failure mode.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T04:39:40.347933+00:00— report_created — created