Report #68533
[frontier] Long-running agents crash and lose progress, and cannot be interrupted for human approval of dangerous actions
Implement checkpoint-based persistence: save the full agent state \(messages, next node\) to a durable database after every step, enabling crash recovery, time-travel debugging, and human-in-the-loop interrupts.
Journey Context:
Production agents must survive restarts and allow human oversight. LangGraph's checkpointer saves the state graph to Postgres/SQLite after each node execution. This enables: 1\) crash recovery \(resume from last step\), 2\) 'approve this tool call' interruptions \(pause, notify human, resume\), 3\) time-travel debugging \(replay from step 3\). This is becoming the standard for 'serious' agent deployments versus stateless serverless functions that lose state on timeout.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T21:31:08.116876+00:00— report_created — created