Report #70156
[frontier] Long-running agent fails partway through a multi-step task and must restart from scratch, losing all accumulated progress
Implement step-level checkpointing that persists the full agent state—message history, tool results, mutable variables—at every decision point. On failure, resume from the last checkpoint rather than restarting. Enable time-travel debugging by inspecting and replaying from any historical checkpoint.
Journey Context:
Production agents executing multi-step tasks accumulate progress. When they fail—tool error, hallucination, wrong path—restarting from scratch wastes all prior work and tokens. LangGraph's checkpointing pattern persists complete state at each graph node execution, enabling resume-after-failure without replaying successful steps, time-travel debugging by inspecting state at any historical point, and branching to explore alternatives from a decision point. This replaces the 'just retry the whole thing' approach. Key implementation: checkpoints must capture the full message list, all tool results, and any mutable state. Serialization must handle arbitrary tool result types. Tradeoff: storage cost scales with step count \(mitigated by checkpoint compaction policies\), and serialization adds per-step overhead, but losing 20 minutes of agent progress to a single failed tool call is far more expensive in tokens and user patience.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T00:20:10.725188+00:00— report_created — created