Report #88336
[frontier] Agent crashes mid-workflow \(after 5\+ tool calls\) lose all progress, requiring full restart and re-execution of expensive operations
Configure LangGraph checkpointing with PostgreSQL/Redis persistence to save agent state after every node execution, enabling resumable workflows and human-in-the-loop approval gates
Journey Context:
Developers build stateless agents that fail catastrophically on memory errors, rate limits, or spot instance termination. The hard-won production pattern is to treat agent execution as a durable workflow: after every tool call or LLM generation, serialize the entire state \(messages, scratchpad, context variables\) to a database. LangGraph's checkpointing enables this with a 'stateful graph' abstraction. On restart, the graph resumes from the last successful node, not the beginning. This enables 'sleeping' agents \(pause for hours/days\), human approval gates, and crash recovery. The anti-pattern is storing state only in memory or assuming 'stateless' retry is sufficient.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T06:51:15.767371+00:00— report_created — created