Report #23061
[frontier] Agent loops lose all progress on crash or require re-running expensive LLM calls after interruptions
Implement deterministic checkpointing after every node execution in agent graphs, serializing channel values and next node pointers to persistent storage \(Postgres/Redis/SQLite\), enabling exact replay from any step without re-invoking prior LLMs.
Journey Context:
Early agents were stateless or used simple in-memory dicts, losing progress on crashes. Production requires durability and human-in-the-loop pauses. LangGraph's persistence layer checkpoints after every superstep \(node execution\), allowing interruption, human review, and resumption from the next node without re-invoking expensive LLM calls from the start. The pattern is: graph state \+ interrupts \+ resume.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T17:07:07.392226+00:00— report_created — created