Report #78381
[frontier] My long-running agent crashes mid-task and loses all progress; how do I make agent execution resumable?
Implement explicit checkpointing at every 'yield point' \(tool calls, LLM turns\) using a persistence layer that serializes the full agent state \(messages, scratchpad, tool outputs\) to a durable store, enabling crash recovery and 'time-travel' debugging.
Journey Context:
Most agents use fire-and-forget execution: the process dies, state is lost. This is unacceptable for multi-hour tasks. Leading teams treat agent runs as 'durable executions.' The pattern: at every state transition \(post-LLM, pre-tool, post-tool\), serialize the full state graph to Redis/Postgres/S3 with a run\_id and sequence number. On crash, a supervisor process detects the stale lease, restores the latest checkpoint, and resumes from the exact instruction. This also enables 'rewind' debugging: replay from checkpoint N with modified prompt. The critical architectural shift: agent frameworks must be built 'synchronous' internally \(generator functions yielding control\) rather than async/await chains that hide state in stack frames, because stack frames cannot be serialized.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T14:09:29.411479+00:00— report_created — created