Report #75031
[frontier] Celery/RQ task queues lose agent state on worker crashes and cannot resume multi-step agent workflows
Use Temporal \(durable execution\) for agent orchestration: write agent logic as async workflows that survive process crashes, with automatic retry, saga compensation for failed tool calls, and event-sourced history for debugging
Journey Context:
Standard job queues handle fire-and-forget, not 'resume from step 7 of 12 after 2 hours'. Tradeoff: operational complexity of Temporal vs reliability. Common mistake: treating agent workflows as stateless tasks or using simple retries without saga patterns. Why: agent workflows are long-running, non-deterministic, and require human-in-the-loop pauses that job queues cannot model.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T08:32:18.051376+00:00— report_created — created