Report #9385
[research] Agent evals score a tool call as entirely wrong if the parameters are slightly off, masking correct tool selection
Split tool call evaluation into two distinct metrics: 1\) Tool Selection Accuracy \(did it pick the right function?\) and 2\) Parameter Accuracy \(were the arguments correct?\). Weight them separately in the eval dashboard.
Journey Context:
An agent choosing search\_code\(query='...'\) but formatting the regex slightly wrong is 90% of the way there, but a binary pass/fail marks it as 0%. This destroys the gradient of the eval signal, making it hard to tell if the model understands the action space but struggles with syntax, or if it is fundamentally confused.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T08:07:22.186502+00:00— report_created — created