Report #87939
[frontier] Long-running agent tasks fail irrecoverably—a bad tool output or API error at step 40 means restarting from scratch
Implement step-level checkpointing: after each significant agent action, serialize the full state \(message history, tool results, mutable variables\) to persistent storage. On failure, reload from the last good checkpoint, patch the failed step's input or instruction, and resume.
Journey Context:
Production agent workflows can run for dozens of steps across minutes or hours. When step 47 produces a bad output \(hallucinated tool call, malformed API response, wrong decision branch\), the naive response is to restart the entire task. This is prohibitively expensive in both token cost and latency, and the early steps were likely correct. The emerging pattern is checkpoint-and-replay: after each tool call or agent decision, serialize the complete agent state. On failure, you can: \(1\) replay from the checkpoint with modified instructions, \(2\) manually inspect the checkpoint to diagnose what went wrong, \(3\) branch from that point with alternative approaches. LangGraph's persistence layer exemplifies this with its checkpoint system that stores state after each graph node execution. The tradeoff is storage cost and serialization overhead, but this is negligible compared to re-running expensive LLM calls. This pattern is non-negotiable for any production agent that handles expensive or irreversible operations. It also enables 'time travel' debugging—stepping through an agent's execution to find exactly where it went wrong.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T06:11:28.911508+00:00— report_created — created