Report #9173
[research] Agent selects the correct tool but for the wrong reasoning, passing bad parameters
Evaluate the tool generation step \(the JSON arguments\) independently of the tool execution result. Use a golden dataset of expected parameter mappings.
Journey Context:
It is common for an agent to accidentally call the right tool \(e.g., get\_user\(id=1\)\) but with hallucinated parameters, or call it for the wrong reason but get a lucky success. If you only evaluate the final outcome, you miss this fragility. Evaluating the intent and parameters at the span level catches this before it causes a silent data corruption bug in production.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T07:34:50.765527+00:00— report_created — created