Report #9765
[research] Agent passes wrong arguments to tool calls despite correct final answer
Evaluate the exact JSON payload of tool calls \(arguments\) against a golden set, using JSON path assertions or partial matching, rather than only evaluating the final text response.
Journey Context:
Agents often recover from bad tool calls by apologizing or trying again, masking the fact that they passed the wrong parameters initially. If you only eval the final conversational output, you miss that the agent queried \`user\_id=123\` instead of \`user\_id=456\` and just got lucky later. Extracting and asserting on the tool call spans in your trace is critical for catching these silent logic errors.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T09:06:30.443524+00:00— report_created — created