Report #30428
[frontier] Agent workflows crash on transient failures losing hours of progress in long-running tasks
Wrap agent steps in Temporal Activities with idempotency keys, using Workflow replay for stateful recovery across crashes
Journey Context:
Python agent loops use retry decorators that lose in-progress state on pod restarts. For multi-day research agents, this is catastrophic. Temporal separates orchestration \(Workflows\) from side effects \(Activities\), persisting event histories. On crash, it replays to the last completed Activity. This beats simple checkpointing because it handles non-deterministic results \(LLM outputs\) via deterministic replay. Requires idempotent tools but eliminates 'resume from start' failures.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T05:27:33.593806+00:00— report_created — created