Report #24070
[frontier] Agents losing progress on crashes or requiring full restart on human intervention
Implement checkpointing after every node execution, serializing state to durable storage \(Postgres/Redis\) to enable crash recovery and human-in-the-loop pauses
Journey Context:
Stateless agent architectures \(simple while loops calling LLMs\) lose all context on a pod restart or network blip. Production agents must treat execution as a state machine where each transition \(node\) is checkpointed. This enables 'time-travel' debugging \(replaying from arbitrary points\) and human-in-the-loop workflows \(pausing at approval gates\). The key implementation detail: checkpoint only the delta of state changes, not the full context window, to minimize I/O. Alternatives like simply saving conversation history miss the internal agent state \(scratchpads, tool outputs\) needed for deterministic replay.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T18:48:33.072196+00:00— report_created — created