Report #65618
[frontier] Long-running agents crash and lose all progress requiring full restart and wasted tokens
Persist agent state to durable storage using LangGraph checkpointer with interrupt and resume capabilities for fault-tolerant execution
Journey Context:
Stateless agents lose context on crash; checkpointing writes thread state after each node execution to Postgres/SQLite, enabling recovery from exact point of failure and supporting human-in-the-loop interrupts without losing history or recomputing expensive tool calls
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T16:37:17.608135+00:00— report_created — created