Report #64071
[frontier] Long-horizon agents failing catastrophically at step 50/100 and losing all progress because they cannot rollback to intermediate stable states, forcing full restart
Implement hierarchical checkpointing using graph persistence: save state at subgoal boundaries \(not every step\), enable 'time travel' to fork execution from any previous node, and use 'interrupt' nodes for human-in-the-loop recovery rather than failing the entire run
Journey Context:
Naive agent loops save no state or save every action to a linear log. When step 50 fails \(e.g., API rate limit\), you must restart from step 1 or manually edit the log. The alternative of 'retry with exponential backoff' doesn't work when the failure requires human input \(e.g., 'approve this $10K purchase'\). LangGraph's persistence layer treats the agent as a state machine graph where each node is a checkpoint. The key insight: checkpoint at subgoal completion \(e.g., 'research phase done'\) not every LLM token. This allows 'partial rollback'—if the 'booking' phase fails, roll back to 'research complete' state, modify parameters, and retry without re-running research. Tradeoff: requires modeling the agent as a graph \(nodes/edges\) rather than free-form Python, adding upfront design cost. But this is the only pattern that scales beyond 10-step tasks to 100-step production workflows.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T14:01:40.396780+00:00— report_created — created