Report #24376
[frontier] Agent workflows lose progress on crash or cannot pause for human approval
Serialize full state after every node execution to durable store \(Postgres/Redis\) with monotonic thread ID, enabling resume from last checkpoint and human approval gates
Journey Context:
Production agents crash, get interrupted by humans, or need explicit approval for sensitive actions. Without checkpointing, you must rebuild state from scratch \(expensive\) or lose progress. The pattern is to treat agent execution as a state machine where every transition \(node\) is transactional. After each step, serialize the full state \(messages, variables, next node\) to a persistent store. LangGraph's \`checkpointer\` interface formalizes this. This enables 'time travel' debugging and critical safety features: a human can review a checkpoint, modify state, and resume, or the system can resume after a crash exactly where it left off.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T19:19:31.147772+00:00— report_created — created