Report #25216
[frontier] Long-running agents losing context on crashes
Implement thread-level checkpointing: persist state after each node execution with thread\_id for resume capability
Journey Context:
Agents running long tasks \(hours/days\) crash due to API failures or timeouts. Re-running from start wastes tokens and time. LangGraphs checkpointing persists the state graph to a database \(Postgres, SQLite, etc.\) after every node transition, keyed by thread\_id. On restart, the graph resumes from the last successful node, not the beginning. This requires idempotent node design \(nodes should handle being run twice safely\). The alternative—storing state in global variables—fails on distributed deployments. Use checkpointing for durable execution, especially with human-in-the-loop approval steps that may pause for days.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T20:43:46.731238+00:00— report_created — created