Report #49267
[frontier] How to build long-running agents that survive crashes and maintain complex state across days?
Adopt LangGraph's Functional API with explicit checkpointing: define state machines as async Python functions using @entrypoint decorator, configure PostgresSaver or RedisSaver for persistence, and structure state as Pydantic models for schema validation at each step.
Journey Context:
Earlier frameworks used imperative 'while' loops that lost state on restart. The 2025 shift treats agent execution as durable workflow \(similar to Temporal.io but LLM-native\) where every step is checkpointed to a database. This requires abandoning simple 'ask the LLM' loops for explicit graph nodes, but it solves the production failure mode where a 20-step task fails at step 19 and must restart from scratch. The alternative—keeping agents stateless and idempotent—works only for simple tasks, not for multi-day research or coding workflows.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T13:10:27.256347+00:00— report_created — created