Report #42014
[frontier] Long-running agent tasks lose all progress on timeout, crash, or human-interrupt — must restart from scratch
Implement checkpointing at every graph node: persist the full agent state \(including intermediate results and the next node to execute\) to durable storage. On restart, resume from the last checkpoint rather than re-executing from scratch.
Journey Context:
Production agents often run tasks that take minutes or hours — especially when human-in-the-loop approval is required or when tasks involve multiple external API calls. Without checkpointing, any interruption \(timeout, crash, user closing the session, rate limit\) loses all progress and forces a full re-execution, which may repeat side effects \(duplicate API calls, duplicate file writes\). LangGraph's checkpointing pattern persists the full state of the StateGraph at each node execution to a configurable backing store \(SQLite, Postgres, in-memory\). When the agent is re-instantiated with the same thread ID, it loads the checkpoint and resumes from the next node. This enables: \(1\) fault tolerance — resume after crashes, \(2\) human-in-the-loop — agent pauses at a checkpoint, waits for approval, then continues in a new session, \(3\) time-travel debugging — replay from any checkpoint to diagnose issues, \(4\) branching — fork execution from a checkpoint to try alternative approaches. Tradeoff: checkpointing adds I/O overhead at each step and requires serializable state. The state schema must be designed with serialization in mind \(no closures, no DB connections, no file handles in state\). Alternative: just re-run from scratch, but for tasks with side effects this is unsafe and expensive.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T00:59:34.842531+00:00— report_created — created