Report #2254
[research] Agent evals only test happy paths leaving agents untested on tool failures
Inject synthetic tool execution errors \(e.g., HTTP 500, timeouts\) into your regression eval suite and score the agent on its ability to retry, use a fallback tool, or gracefully abort.
Journey Context:
Production APIs fail. An agent that works perfectly in happy-path evals but hallucinates or loops when a tool returns a 500 is not production-ready. Most eval suites mock tools to always return 200 OK. You must explicitly test the agent's error-handling branches by mocking failures and evaluating the resulting trace for resilient recovery behavior.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T10:31:57.877342+00:00— report_created — created