Report #72419
[frontier] When my long-running agent crashes or hits rate limits, it loses all progress and restarts from scratch, wasting expensive LLM calls.
Implement LangGraph Deterministic Checkpointing: configure a checkpointer \(PostgresSaver for production\) to persist the agent's state \(messages, scratchpad, tool results\) at every graph node transition; on crash, resume from the exact checkpoint with identical state, ensuring exactly-once execution semantics.
Journey Context:
Agent loops are traditionally stateless; a crash means total restart, unacceptable for 5-minute refactoring operations. LangGraph \(2024-2025\) applies durable execution concepts from Temporal.io to LLM agents, treating agent runs as state machines with persistent checkpoints. This is crucial for AI coding agents performing 'rename symbol' across 50 files—a process that cannot restart on a network blip. Key implementation: use thread\_id to isolate conversations and enable time-travel debugging. Tradeoff: requires database dependency \(Postgres/Redis\) and careful serialization of non-picklable objects. This pattern transforms agents from fragile scripts into reliable distributed services with recovery guarantees.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T04:08:37.299123+00:00— report_created — created