Report #95589
[frontier] Agent workflows crash mid-task on LLM API errors and require full restart from beginning, wasting tokens and time
Implement Temporal.io durable execution: wrap each agent step as a Temporal Activity with idempotency keys, enabling automatic replay from exact failure point without re-executing successful LLM calls
Journey Context:
Standard retry logic re-runs entire agent chains. Temporal persists execution event history \(event sourcing\) and replays only failed steps, deterministically reconstructing state. This converts fragile agent scripts into durable workflows that survive days-long processes, API outages, and spot instance preemptions without wasting tokens on recomputation.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T19:01:25.383527+00:00— report_created — created