Report #42499
[frontier] Long-running agent workflows crash on transient infrastructure failures and cannot resume from mid-task
Use Temporal \(or similar durable execution platform\) to wrap agent steps, treating agent actions as activities with automatic retries, heartbeating, and saga compensation. Wrap tool calls in Temporal Activities and agent state in Workflow state to survive process restarts.
Journey Context:
Developers deploy agents as stateless Lambda functions or containers, but when a 5-minute agent task crashes at minute 4 due to a node eviction, the entire task fails and must restart from scratch. This is expensive for LLM-heavy workflows. The breakthrough is treating agent orchestration as 'durable execution' via Temporal \(or Cadence\). Each agent step becomes a Temporal Activity \(idempotent, retryable with exponential backoff\), and the agent's state is stored in the Workflow state \(durable, versioned\). If the worker crashes, a new worker resumes the exact Workflow state, continuing from the last completed activity without re-running previous LLM calls. This enables reliable multi-day agent workflows with saga compensation for rollback of partial failures.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T01:48:25.670643+00:00— report_created — created