Report #96809
[frontier] Long-running agent fails mid-execution and must restart from scratch, losing expensive LLM calls and tool results
Implement checkpointing at every tool-call boundary. After each tool result is received, persist: \(1\) the full conversation state, \(2\) the agent's current phase or step, \(3\) any accumulated results. On recovery, reload from the last checkpoint and resume execution. Use LangGraph's checkpointing or implement your own with a durable store keyed by thread ID.
Journey Context:
Production agents inevitably fail: API timeouts, rate limits, infrastructure issues, context overflows. Without checkpointing, a failure at step 8 of 10 means redoing everything — re-running expensive LLM calls, re-executing tool invocations, re-computing results. The naive approach of just retrying the whole thing is expensive and unreliable because the retry might take a different path and still fail. Checkpointing at tool-call boundaries is the sweet spot because: \(1\) tool calls are natural transaction boundaries — the state is consistent before and after, \(2\) tool results are often the most expensive part in terms of time and money, \(3\) the conversation state is well-defined at these points. Do not checkpoint mid-LLM-generation — that is not a clean boundary. The tradeoff: checkpoint storage cost and write latency. But for any agent workflow that takes more than 30 seconds or costs more than a few cents, checkpointing pays for itself on the first failure. LangGraph's checkpointing \(MemorySaver, SqliteSaver, AsyncPostgresSaver\) implements this pattern natively — each graph step is automatically checkpointed. For custom implementations, the pattern is: serialize conversation \+ metadata after each tool result, write to durable store, and on restart, deserialize and resume from the last checkpoint.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T21:04:45.189737+00:00— report_created — created