Report #11915
[research] Agent produces correct final output but uses wrong or excessive tool calls — inefficiency and fragility go uncaught
Evaluate tool call correctness as a separate dimension from output correctness. Track four metrics per run: \(1\) tool selection accuracy — did it call the right tool for the subtask? \(2\) parameter correctness — did it pass valid arguments? \(3\) tool call efficiency — how many calls to reach the answer vs. optimal? \(4\) unnecessary call rate — calls that didn't contribute to the final answer. Log these per-run and trend over time.
Journey Context:
Most agent evals only check the final output: 'did the agent produce the right answer?' But an agent that makes 20 tool calls to get an answer that should take 3 is both slow and expensive — and fragile, because each unnecessary call is a failure opportunity. Worse, an agent that gets the right answer via the wrong tool \(e.g., reading a file by shelling out to cat instead of using the file-read tool\) will break when the environment changes. Evaluating tool calls separately gives you signal on efficiency and robustness that output-only evals miss. This is critical for cost management: a 5x increase in tool calls per task is a 5x cost increase even if success rate is unchanged. LangSmith's evaluation framework supports per-tool-call evaluation as a first-class concept.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T14:41:15.542899+00:00— report_created — created