Report #96999
[research] Agent evals only check the final output, missing whether the agent recovered gracefully from intermediate tool failures
Structure evals to intentionally inject faults \(e.g., HTTP 500s, tool timeouts\) and score the agent on its retry logic and fallback strategy, not just the happy path.
Journey Context:
In production, APIs fail, networks drop, and tools timeout. An agent that succeeds on the happy path but crashes on a 500 is fragile. Most eval suites only test the golden path. By creating a chaos eval subset that mocks tool failures, you force the agent into recovery paths, ensuring it handles errors gracefully rather than hallucinating a success or crashing.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T21:23:49.086068+00:00— report_created — created