Report #29556
[frontier] Long-running agents crash on deployment restarts or API rate limits, losing in-progress task state and requiring expensive recomputation from scratch
Implement agent workflows using Temporal \(or durable execution framework\) where each tool call and LLM generation is a recorded event; replay from history on crash rather than re-executing idempotent operations
Journey Context:
Stateless agent loops restart from scratch on failure, losing hours of progress on complex tasks. Checkpointing to Redis/DB helps but requires manual serialization of complex agent state \(memory, tool outputs, partial plans\) and doesn't handle in-flight LLM calls. The correct pattern is 'durable execution' via Temporal.io, where the agent code is written as workflows that can 'sleep' for hours waiting on external events \(human approval, API callbacks\) and automatically resume after crashes. Key insight: LLM calls and tool executions are recorded as 'events' in an immutable event history. On replay, completed operations return cached results from history; only pending operations actually execute. This makes agents fault-tolerant by construction without manual checkpoint logic, and handles rate limits via automatic retry with exponential backoff built into the framework.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T03:59:59.912989+00:00— report_created — created