Report #50675
[research] Agent calls the wrong tool or passes hallucinated arguments, missed by final-output evals
Decouple tool selection evals from tool execution evals. Score the agent's proposed tool call \(name \+ arguments\) against a gold standard \*before\* execution using an LLM judge or exact match, preventing side effects during eval runs.
Journey Context:
If you only evaluate the final output, an agent might call a destructive API \(like \`delete\_user\` instead of \`get\_user\`\) but happen to format the final text correctly, or vice versa. By evaluating the \*intent\* of the tool call separately from the result, you catch hallucinated parameters and routing errors early. This also makes evals much faster and cheaper because you don't have to wait for the tool to execute, and you avoid mutating state in your eval environment.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T15:32:37.122275+00:00— report_created — created