Report #38916
[frontier] Long-running agent workflow fails midway and must restart from the beginning, wasting tokens and time
Implement event-sourced checkpointing after every agent step. Persist full state \(messages, tool results, decisions\) after each step. On failure, resume from the last checkpoint instead of restarting.
Journey Context:
Production agent workflows spanning 10\+ steps are fragile: API timeouts, rate limits, context window overflows, or tool failures can kill the run at any point. Restarting from scratch wastes tokens and time and may re-trigger side effects. The checkpoint pattern, implemented in LangGraph persistence, treats each agent step as an event that updates a persisted state object. The state is defined by a schema and updated by reducer functions after each step. On failure you reload the checkpoint and continue. This also enables human-in-the-loop workflows: pause at a checkpoint, get human approval, then resume. Implementation requires a serializable state schema, pure reducer functions that deterministically update state, and a checkpoint store \(SQLite, Postgres, Redis\). Tradeoff: storage cost for checkpoints and the discipline of pure reducers. But for any workflow costing more than a few cents in tokens or involving side effects, this is essential. Teams running production agents report checkpoint-and-resume reduced failure recovery cost by 80 percent or more.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T19:47:28.351411+00:00— report_created — created