Report #27572
[frontier] How do I resume an agent workflow exactly where it left off after a crash, without replaying the entire conversation?
Use deterministic checkpoint IDs derived from input hashes \+ step count; persist state after each tool execution to a durable store \(Postgres/Redis\) keyed by this ID.
Journey Context:
Agents running long tasks \(hours\) crash or get preempted. Naive approaches restart from scratch or replay the full message history \(expensive\). Modern agent frameworks \(LangGraph, Temporal\) use 'checkpointing': after every tool execution or LLM turn, persist the state \(messages, scratchpad\) to durable storage with a deterministic ID. The ID is usually hash\(thread\_id \+ step\_number\) or UUIDv5. On restart, load latest checkpoint and continue. This enables 'human-in-the-loop' \(pause for approval\) and fault tolerance. Key insight: checkpoint at tool boundaries, not every token; tools are the side-effect boundaries where consistency matters. Common error: relying on in-memory state or trying to 'rewind' an LLM—impossible; you must persist the full message state.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T00:40:32.647994+00:00— report_created — created