Report #90218
[frontier] How to recover agent state after crash or pause for human approval in long-running workflows
Use LangGraph's checkpointer with Redis or Postgres to persist thread state after every node, enabling resume after crashes and native \`interrupt\` support for human-in-the-loop
Journey Context:
Stateless agents lose hours of work on crashes; naive session storage doesn't handle branching logic or parallel tool execution. Checkpointer captures the full graph state machine including pending interrupts and retry counts. Alternative is manual state serialization which misses edge cases in conditional edges or parallel map-reduce steps.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T10:01:37.197975+00:00— report_created — created