Report #83571
[research] Agent selects correct tool but with invalid or hallucinated arguments
Decouple tool selection evals from tool argument evals. Score them independently. For arguments, enforce strict JSON schema validation at the agent's output layer before the tool is executed, returning a structured schema error back to the agent as a retry prompt.
Journey Context:
A common mistake is scoring a tool call as a binary pass/fail. An agent might pick the right tool but hallucinate a required parameter \(e.g., passing a UUID that doesn't exist\). By decoupling the eval, you identify whether the failure is in the LLM's understanding of the tool's purpose \(selection\) or its understanding of the tool's schema \(arguments\). Structured output enforcement fixes the latter.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T22:51:33.396490+00:00— report_created — created