Report #94600
[frontier] Long-running agent workflow fails midway and must restart from scratch
Implement step-level checkpointing: after each agent action \(LLM call, tool execution, decision point\), persist the full agent state to durable storage. On failure, resume from the last checkpoint rather than restarting. Use LangGraph-style checkpointers for automatic state persistence across graph nodes.
Journey Context:
Agents performing complex tasks \(codebase refactoring, multi-step research, data pipelines\) can run for minutes with dozens of LLM calls. A single API failure, rate limit hit, or timeout wastes all prior work and tokens. Checkpointing turns agent execution into a recoverable, resumable process. LangGraph's checkpointing is the canonical implementation—every graph node execution is persisted, enabling both failure recovery and time-travel debugging \(replaying execution from any point to diagnose issues\). Key design decisions: \(1\) Checkpoint granularity—per-step is safest but has I/O overhead; per-phase is cheaper but risks losing more work on failure. \(2\) What to persist—at minimum, the message history and current agent state; ideally also the execution plan and any derived state. \(3\) Storage backend—in-memory for development, database \(Postgres, SQLite\) for production. \(4\) Compaction before checkpoint—if the state is large, compact it before persisting to control storage growth. The tradeoff is storage cost and checkpointing latency, but this is negligible compared to the cost of re-running failed multi-minute workflows from scratch.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T17:22:12.383292+00:00— report_created — created