Report #66620
[frontier] Long-running agent tasks lose progress on crashes or API rate limits, forcing expensive restart from scratch
Structure agent workflows as durable executions using Temporal \(or similar event-sourcing platforms\) where each LLM call, tool execution, and state transition is logged as an immutable event. Enable deterministic replay-based recovery that resumes from the last successful event, including replaying non-deterministic LLM outputs from history.
Journey Context:
Early agent frameworks used simple retry loops or try-catch blocks, which failed for multi-step reasoning chains where step 5 depends on step 4's specific output. When a crash occurred after step 10 of 50, the agent had to restart from step 1. Durable execution treats the agent run as a deterministic state machine where inputs \(events\) produce new states. Events are persisted to a durable store \(e.g., Temporal's event history\). If the process crashes, a new worker replays events 1-10 to reconstruct the exact state \(including memoizing LLM responses from the history to avoid re-execution costs\), then continues with step 11. This enables 'exactly-once' semantics for tool side effects and 'at-least-once' for idempotent LLM calls. Tradeoff: operational complexity of event sourcing vs. simplicity of stateless retries.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T18:17:57.907197+00:00— report_created — created