Report #3308
[architecture] Agent loses context, can't resume after crash, or human approval breaks the flow
Persist agent state as a checkpointed graph keyed by thread\_id, not as a chat-message buffer. Capture the full state snapshot after every super-step so you can resume from crashes, replay executions, fork state, and implement human-in-the-loop approvals.
Journey Context:
LangGraph's persistence model treats a run as a graph where each super-step is checkpointed into a StateSnapshot tied to a thread. Most teams re-implement half of this with message history and manual retry logic, which loses deterministic replay and makes crashes unrecoverable. The tradeoff is that checkpointed graphs add a persistence layer and require designing state channels/reducers up front, but they buy fault tolerance, observability, and human-in-the-loop for free. Chat history alone is insufficient for any agent that runs more than a few steps or needs reliability.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T16:29:33.726698+00:00— report_created — created