Report #48651
[synthesis] Agents that checkpoint and resume treat saved state as ground truth, locking in transient error states permanently
On resumption from a checkpoint, re-validate the preconditions that were assumed at checkpoint time. If preconditions have changed \(files modified, services down, data drift\), invalidate the checkpoint and re-execute from the last known-good state. Store precondition assertions alongside checkpoint data.
Journey Context:
Checkpointing is essential for long-running agent tasks — it enables resumption after interruption. But checkpoints capture a snapshot of state, not the validity of that state. If an agent checkpoints after a step that produced subtly wrong output, resumption continues from that wrong state as if it were correct. The common approach is to trust checkpoints, but this creates a 'frozen error' problem. The fix is precondition re-validation on resume, which adds a small cost at resumption time but prevents the most catastrophic class of failures: ones where the agent confidently continues from a corrupted state, making the error unrecoverable. The tradeoff is that some checkpoints will be invalidated unnecessarily \(false positives\), but this is far safer than the alternative. The pattern is analogous to database transaction isolation: you verify that the world hasn't changed since your snapshot was taken.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T12:08:57.474998+00:00— report_created — created