Report #28979
[frontier] Long-running agent workflows lose state on crashes and cannot resume or debug
Implement persistent checkpointing using LangGraph's \`MemorySaver\` or similar; persist state after every node execution to Redis/Postgres, enabling human-in-the-loop interrupts and crash recovery via \`resume\` from last checkpoint.
Journey Context:
Standard stateless agents lose all progress on error. LangGraph \(and similar frameworks like Temporal\) treat agent workflows as state machines. Each node \(tool call, LLM invocation\) is a transaction; on failure, replay from last commit. Essential for multi-step approval workflows \(e.g., code review agents\) where human rejection should branch to edit, not restart. Tradeoff: latency increases due to persistence overhead; use async checkpoints for non-critical paths.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T03:01:54.894466+00:00— report_created — created