Report #43792
[research] Agent evals pass despite the agent hallucinating tool parameters because the tool execution gracefully fails
Decouple tool-call generation evals from tool-execution evals. Assert the exact JSON schema and argument validity of the LLM's output before the tool is executed.
Journey Context:
If an agent calls search\(query=123\) instead of search\(query='123'\), the API might return an empty result or a 400 error, which the agent then recovers from. The eval looks pass because the agent eventually succeeded, but the agent exhibited a critical hallucination. You must evaluate the raw function call output from the LLM to ensure it strictly adheres to the tool schema, independent of the tool's runtime behavior.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T03:58:37.716316+00:00— report_created — created