Report #54774
[frontier] Long-running agent workflows fail midway and lose all progress, requiring full restart from scratch
Implement persistent checkpointing using LangGraph's built-in persistence layer, serializing not just messages but the full graph state \(channel values, memory, tool outputs, current node position\) to a thread\_id, enabling resume from any step
Journey Context:
Teams initially tried to persist only the message history, but this loses the agent's internal state \(e.g., which tools were already called, intermediate variables, current graph node\). Full serialization of the State object including channel values is necessary. The tradeoff is storage size vs. reliability. Use Redis or Postgres checkpointers for production, not memory.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T22:26:02.258635+00:00— report_created — created