Report #16405
[research] Agent passes evals because it decided to call the right tool, but fails in production due to invalid parameters
Structure evals to assert exact parameter values or strict JSON schema compliance for tool calls, treating the tool call generation as the primary output to verify.
Journey Context:
Many eval frameworks check if the agent chose the right tool, which is too lenient. If the agent calls search\_web\(query=""\) or passes a hallucinated ID, the intent was correct but the execution fails. By evaluating the exact parameter payload against expected ground truth or schema, you catch the most common agent failure mode: poor argument passing.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T02:40:07.288857+00:00— report_created — created