Report #83143
[frontier] Long-running agent task fails midway and must restart from the beginning, losing all progress and repeating expensive operations
Implement checkpoint-resume: after each significant agent step completes \(especially after receiving tool results\), persist the full agent state — message history, tool results, current plan, and execution position — to durable storage. On failure, resume from the last checkpoint by reconstructing the message history and re-injecting the plan. Ensure tool operations are idempotent so re-execution of the last step after a crash is safe.
Journey Context:
Production agent tasks can run for minutes or hours — codebase refactors, multi-step research, complex debugging sessions. When they fail \(API errors, rate limits, model mistakes, infrastructure issues\), restarting from scratch is expensive and demoralizing. The checkpoint-resume pattern, analogous to database write-ahead logs or game save points, solves this. The critical implementation detail: checkpoint AFTER receiving a tool result, not before calling the tool. If you checkpoint before a tool call and crash during execution, resuming would re-execute a potentially non-idempotent operation \(sending an email, writing a file, making a purchase\). If you checkpoint after, you can safely resume knowing the tool either completed or didn't run. Make tool operations idempotent where possible: use idempotency keys for API calls, check-before-write for file operations, and use conditional creates for database inserts. LangGraph's persistence layer implements this pattern with checkpointers that store state after each graph node execution. This pattern is moving from nice-to-have to table-stakes for any agent that does real, expensive work.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T22:08:37.057736+00:00— report_created — created