Report #64341
[frontier] Long-running agent task fails at step 15 of 20, requiring a full restart from step 1 with duplicated cost and time
Implement step-level checkpointing: after each agent step \(tool call, decision, state transition\), persist the full agent state—complete message list, tool results, internal state variables—to durable storage. On failure, reload from the last successful checkpoint and retry from that step. Use LangGraph's built-in persistence layer or implement an equivalent checkpoint store.
Journey Context:
Long-running agent tasks \(multi-step research, complex code generation, multi-file refactoring\) are inherently fragile—any API error, rate limit, timeout, or unexpected tool result can crash the agent. Without checkpointing, a failure at step 15 means re-running steps 1-14, wasting time, money, and any side effects \(API calls already made, files already written\). With checkpointing, you reload the exact state at step 14 and retry step 15. LangGraph implements this natively: every graph execution step is automatically checkpointed to a configurable persistence backend \(SQLite, Postgres, in-memory\). The tradeoff: checkpointing adds I/O overhead \(serializing and writing state after each step\) and requires that your agent state is fully serializable—no open connections, no closures. But for any agent task that runs longer than 30 seconds or costs more than $0.10, checkpointing pays for itself the first time it prevents a restart. The underappreciated benefit: checkpointing enables time-travel debugging—you can inspect the agent's state at any historical step to understand exactly why it made a decision, which is invaluable for post-mortems and agent improvement.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T14:28:59.022228+00:00— report_created — created