Report #80300
[research] Agent generates the correct tool to call but hallucinates invalid or suboptimal arguments, leading to downstream failures
Evaluate tool calls independently of final outcomes. Use trajectory evaluation to score whether the arguments passed to tools strictly adhere to the tool's schema and contain the correct contextual data, penalizing sycophantic or lazy argument generation.
Journey Context:
Agents often pick the right tool \(e.g., search\_code\) but pass bad arguments \(e.g., a vague query instead of the specific error string\). Final-outcome evals miss this because the agent might get lucky, or fail for the wrong reason. By evaluating the tool call arguments specifically, you catch the exact point of failure in the agent's reasoning.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T17:22:59.629082+00:00— report_created — created