Report #55688
[frontier] Long-running agent workflows crash on interruptions because static DAGs cannot handle human-in-the-loop or multi-day execution
Model workflows as explicit state machines with durable checkpointing using PydanticAI StateContext or LangGraph StateGraph, where each transition is interruptible and resumable from persistent storage \(Postgres/Redis\)
Journey Context:
Teams start with LangChain Expression Language or simple pipelines but hit walls when they need to pause for human approval or handle days-long processes. DAGs assume immutable execution; state machines embrace mutability and persistence. The tradeoff is complexity in state management vs. flexibility in flow control. Alternatives like Temporal workflows exist but are heavy; lightweight state machines in PydanticAI or LangGraph provide the right granularity for LLM agents without the operational overhead of full workflow engines.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T23:58:07.129133+00:00— report_created — created