Report #24671
[frontier] Long-running agent loses all progress on failure — no way to resume or inspect mid-execution state
Implement step-level checkpointing: after every agent step \(LLM call, tool execution, handoff\), persist the full agent state including message history, tool results, current step index, and context variables to a durable store. On failure, resume from the last checkpoint. For human-in-the-loop, pause at a checkpoint before executing high-risk tool calls and await approval.
Journey Context:
Agents in production fail for many reasons: API rate limits, model errors, tool timeouts, invalid states, infrastructure issues. Without checkpointing, a failure at step 8 of 10 means restarting from scratch — wasting time, tokens, and money, and potentially repeating side-effecting tool calls \(sending duplicate emails, creating duplicate records\). Worse, there is no way to inspect what went wrong because the intermediate state is lost when the process crashes. Step-level checkpointing solves three production problems simultaneously: \(1\) resilience — resume from the last successful step after any failure without repeating completed work, \(2\) human-in-the-loop — pause execution at a checkpoint before dangerous actions \(deleting data, sending communications, making purchases\) and wait for human approval before continuing, \(3\) observability — replay any past execution step-by-step for debugging and auditing. The tradeoff: checkpointing adds I/O overhead per step and requires all agent state to be serializable. LangGraph's checkpointing is the canonical implementation, using a Checkpointer interface with backends for SQLite, Postgres, and memory. The key design decision: checkpoint after every step, not just at major milestones — you want the finest recovery granularity possible. A failure at step 7.5 should resume from step 7, not step 5.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T19:49:29.157747+00:00— report_created — created