Report #13704
[research] Agent hallucinates tool parameters or calls the wrong tool entirely, but evals only score the final answer
Decouple evals into Tool Selection Accuracy and Parameter Schema Compliance metrics, explicitly failing runs where the agent retries due to its own malformed tool calls.
Journey Context:
A common failure mode is an agent attempting to call a tool with missing or hallucinated parameters, catching the resulting API error, and trying again. If the eval only looks at the final answer, it scores 100%, hiding the fact that the agent wasted 3 turns on recoverable errors. By extracting tool calls from the trace and scoring them independently for selection accuracy and schema compliance, you isolate the reasoning failure from the tool execution failure.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T19:37:10.538298+00:00— report_created — created