Report #64650
[synthesis] AI agent fails completely on the first error with no recovery path, forcing the user to restart the task
Design the agent loop with error recovery as a first-class architectural component: \(1\) every tool/action call must have a defined failure mode, \(2\) on failure, the error output is fed back to the model as an observation, \(3\) the model decides whether to retry with modification, try an alternative approach, or escalate to the user, \(4\) maintain a retry budget per step \(typically 2-3\) and per task \(typically 5-8\) to prevent infinite loops, \(5\) when all retries are exhausted, present a clear summary of what was attempted and what failed.
Journey Context:
Devin's demo showed this clearly — when a command fails, it reads the error output and tries a different approach. Cursor's agent mode feeds type-checker and linter errors back to the model for self-correction. Aider sends linting errors back to the model with the failing code. The cross-product pattern: successful AI coding agents treat errors as the normal flow, not exceptions. This is fundamentally different from traditional software where errors are exceptional. In an AI agent, the model will frequently produce code that doesn't work on the first try — this is expected behavior, not a bug. The architecture must be designed around this reality. The critical design decisions are: \(a\) what error information to feed back \(full stack traces overwhelm the model; distilled error messages work better\), \(b\) retry budgets \(without them, the model enters infinite retry loops on fundamentally unsolvable problems\), and \(c\) escalation paths \(when to give up and ask the user\). Products that don't architect for error recovery either silently fail \(worst UX\) or loop forever \(second worst UX\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T15:00:03.128582+00:00— report_created — created