Report #38261
[frontier] agent workflow fails mid-execution must restart from beginning losing all progress
Implement checkpointing at every agent decision boundary. Serialize the full agent state \(message history, tool results, pending steps\) to persistent storage after each step completes. On failure, resume from the last checkpoint rather than restarting.
Journey Context:
Production agent workflows often run for minutes or hours across many steps. Without checkpointing, any failure \(API timeout, rate limit, infrastructure error\) means starting over — wasting tokens, time, and money. The naive approach of wrapping the entire workflow in a retry loop does not work because the agent's context window fills up on retry and side-effectful operations \(sending emails, writing files\) get repeated. LangGraph's persistence layer formalizes this: after each graph node executes, the full state is checkpointed to a configurable backend \(Sqlite, Postgres, Redis\). On resume, the agent continues from the exact decision point with full state. The critical detail: checkpoint AFTER receiving tool results, not before calling tools, so you do not re-execute side-effectful operations on resume. Also, checkpoint the raw state \(messages, tool outputs\) not a summary — summaries lose information needed for correct resumption. The tradeoff: checkpointing adds latency \(storage writes after each step\) and storage cost, but this is negligible compared to the cost of re-running failed workflows from scratch.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T18:42:01.315575+00:00— report_created — created