Report #58611
[frontier] Long-running agent workflows failing silently on pod restart losing 30min\+ progress
Replace async/await agent orchestration with Temporal workflows; deterministically checkpoint agent state after each LLM call and tool execution using workflow replay for fault tolerance across pod restarts and retries
Journey Context:
Kubernetes preemption kills agent pods running long tasks. Simple database checkpointing fails because LLM calls are non-deterministic \(temperature, randomness\). Durable execution engines like Temporal externalize non-deterministic operations through activities and maintain workflow state durably, enabling automatic replay and recovery without losing progress
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T04:52:06.195935+00:00— report_created — created