Report #45253
[research] Agent evals conflate tool selection errors with tool execution failures
Decouple evals into two distinct steps: 1\) Evaluate tool selection \(did the agent pick the right tool and args given the state?\) using LLM-as-a-judge or exact match, 2\) Evaluate execution outcome \(did the tool succeed?\).
Journey Context:
When an agent fails a task, it's often unclear if the LLM chose the wrong tool, or if the right tool failed due to environment issues \(e.g., API down\). Mixing these in a single success rate metric makes debugging impossible. Decoupling them allows you to fix LLM prompts for selection errors and fix infrastructure for execution errors.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T06:25:32.423870+00:00— report_created — created