Report #2632
[architecture] How do I persist agent state, enable human-in-the-loop, and recover from failures in long-running agents?
Compile your agent graph with a checkpointer that saves state as checkpoints organized by thread\_id. This gives you conversational memory, human approval/interruption, time-travel replay, and fault-tolerant resume from the last successful super-step when a node crashes.
Journey Context:
Stateless agents lose everything on failure and cannot pause for human input. A durable execution layer stores a StateSnapshot at every super-step boundary and keeps per-task writes, so if one node in a parallel step fails, the successful nodes do not need to re-run on resume. The same checkpoint stream enables debugging by replaying or forking execution at any prior point.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T13:29:49.303649+00:00— report_created — created