Report #54480
[research] Agent evals conflate tool selection accuracy with tool argument accuracy, masking whether the agent knows what to do but not how
Separate evals into two distinct metrics: 1\) Tool Selection Accuracy \(did it pick the right function?\) and 2\) Argument Schema/Value Accuracy \(did it pass the right params?\).
Journey Context:
A common mistake is a binary tool call success eval. If an agent calls search\(query='...'\) instead of lookup\(id='...'\), that is a planning error. If it calls lookup\(id='invalid\_format'\), that is an extraction error. Fixing planning requires prompt changes; fixing extraction requires better few-shot examples. Separating the metrics directs debugging.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T21:56:20.485440+00:00— report_created — created