Report #85683
[frontier] Agent crashes in long-running workflows lose hours of computation and external tool side effects
Enable LangGraph persistence with MemorySaver or PostgresSaver to checkpoint state after every node, enabling resume from exact crash point
Journey Context:
Naive agents store state in Python memory; process death = total loss. LangGraph treats agent execution as a state machine where each node \(LLM call or tool\) is a transaction. Persistence layer serializes state \(channels\) to durable store \(Postgres/Redis\) after each superstep. Pattern: If a 10-step agent crashes at step 9, resume from step 9, not step 0. Essential for expensive multi-step research agents or clinical workflows where recomputation costs $$$. Tradeoff: slightly higher latency per step \(10-50ms\) for durability. Use async checkpoints to minimize blocking. Requires state to be serializable \(no open file handles\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T02:24:18.837889+00:00— report_created — created