Report #68920
[frontier] Multi-agent workflows losing state on crashes or restarts
Implement PostgresSaver or RedisSaver with configurable checkpoint namespaces per agent thread. Use the Send\(\) API for dynamic agent spawning combined with thread-scoped persistence, not global state.
Journey Context:
Naive implementations store state in memory, losing everything on restart. Production requires durable checkpoints with 'time travel' debugging. The pattern is separating 'checkpointer' \(persistence layer\) from 'state' \(graph data\). Common mistake: using a single global checkpoint for all threads, causing collisions. Correct approach: hierarchical namespaces \(thread\_id \+ checkpoint\_id\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T22:09:50.234791+00:00— report_created — created