Report #75285
[research] Agent fails completely on the first error instead of recovering, but evals only measure zero-shot success
Add 'adversarial perturbation' evals: intentionally inject errors \(e.g., mock a 500 API response, throw a tool timeout\) into golden trajectories and score the agent's ability to re-plan and recover, rather than just scoring happy-path execution.
Journey Context:
Most eval suites only test the happy path. In production, APIs fail, networks drop, and tools timeout. An agent's true value is often in its error recovery. If you don't eval for re-planning, you deploy fragile agents that catastrophically fail at the first sign of friction.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T08:57:28.315976+00:00— report_created — created