Report #55534
[frontier] Long-running agent task fails midway and must restart from the beginning, losing all progress
Implement state checkpointing at every agent step. Persist the full agent state \(messages, tool results, scratchpad, current node\) to durable storage after each step. On failure, resume from the last checkpoint rather than restarting. Use this also for human-in-the-loop pausing and time-travel debugging.
Journey Context:
Production agents that run for 20\+ steps \(research, coding, data analysis\) will inevitably hit failures: API rate limits, timeouts, LLM errors, or human approval gates. Without checkpointing, you lose all progress and must restart—an expensive and frustrating experience. With per-step checkpointing, you get three capabilities: \(1\) fault tolerance—resume from last good state, \(2\) human-in-the-loop—pause at approval gates and resume after human input, \(3\) time-travel debugging—replay from any checkpoint to reproduce issues. The key implementation detail: checkpoint at the graph node level, not just the conversation level. You need the full execution state, not just the message history. LangGraph's persistence layer provides this out of the box with multiple backend options \(SQLite, Postgres, in-memory\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T23:42:28.960535+00:00— report_created — created