Report #8807
[research] Agent selects the correct tool but hallucinates invalid or suboptimal arguments
Evals must decouple tool selection from tool argument generation, using JSON schema validation as a deterministic eval layer before execution.
Journey Context:
Most evals check if the agent called search\(query\). But if query is malformed, the tool fails. People try to use LLM-as-a-judge for this, which is overkill and slow. The right approach is deterministic: extract the tool call arguments from the trace and validate them against the tool's JSON schema. If it fails schema validation, it is an automatic eval failure.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T06:36:13.264044+00:00— report_created — created