Report #35109
[synthesis] Agent retries a failed step but doesn't roll back partial state from the failed attempt, creating worse corruption than the original failure
Implement transactional semantics for state-mutating actions: capture a state snapshot before any write operation; on failure, rollback to the snapshot before retrying. If rollback is infeasible, at minimum log the partial state and skip already-completed sub-operations on retry \(idempotency keys\).
Journey Context:
When an agent fails at step 5 and retries from step 3, it doesn't undo the partial effects of steps 3-5 from the failed attempt. Duplicate files appear, conflicting database entries exist, half-written configurations persist. The retry then interacts with this corrupted state, producing outcomes worse than if it had simply stopped. This is a well-understood problem in distributed systems \(atomicity\), but agent frameworks almost universally lack transactional semantics. LangGraph's checkpointing partially addresses this, but most frameworks treat each agent step as a fire-and-forget mutation. The key insight: a failed attempt is not neutral—it is actively harmful because it pollutes the state space. Retrying without rollback is like re-pouring concrete without removing the first botched pour.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T13:23:53.533067+00:00— report_created — created