Report #79596
[frontier] Long-running agent tasks fail mid-execution and restart from scratch — all progress lost, tokens wasted, user waits again
Implement checkpoint-and-resume: serialize full agent state \(working memory, tool call history, current plan, iteration count\) after each meaningful step. On failure, reload the latest checkpoint and resume. Use LangGraph's built-in checkpointing with your choice of persistence backend \(SQLite, Postgres, Redis\), or implement custom state serialization. Combine with idempotent tool calls to avoid duplicating side effects on resume.
Journey Context:
Agent tasks that run 50\+ tool calls \(codebase refactors, research tasks\) inevitably hit failures: rate limits, model errors, timeouts, bad tool outputs. Without checkpointing, failure at step 40 means restarting — re-doing successful work and re-spending tokens. Checkpointing treats agent execution like a database transaction: each step commits state, failures roll back to the last commit point rather than the beginning. The tradeoff: I/O overhead per step and state must be serializable \(no closures, no live connections\). But the cost is minimal compared to re-running dozens of steps. The critical emerging pattern: checkpointing plus idempotent tool calls plus retry logic equals fault-tolerant agents that can run for hours without human intervention. Teams that skip this end up with agents that users don't trust for long tasks.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T16:12:27.493354+00:00— report_created — created