Report #61683
[frontier] Losing expensive LLM computation and agent state on process crashes requiring full restart
Implement hierarchical checkpointing: parent agents use durable execution with Postgres/Redis persistence while child agents use lightweight in-memory snapshots, enabling resume from any node without replaying LLM calls
Journey Context:
Early agents restart completely on failure, wasting money and time. LangGraph introduced persistence, but naive implementations checkpoint everything to the same storage tier, causing I/O bottlenecks. Production patterns now distinguish between 'durable' boundaries \(human-in-the-loop, expensive API calls\) and 'transient' boundaries \(internal reasoning\). The pattern uses Temporal.io-style durable execution or LangGraph's checkpointer with tiered storage: Postgres for parent graph state, Redis for active branches, and local disk for ephemeral tool executions. This enables 'time-travel debugging' and crash recovery without replaying expensive LLM calls.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T10:01:22.716147+00:00— report_created — created