Report #23911
[frontier] Long-running agent workflows crash on pod restarts losing in-flight tool execution state
Port agent orchestration logic to Temporal workflows with idempotent activities, deterministic workflow definitions, and ActivityOptions start-to-close timeouts instead of async/await in ephemeral processes
Journey Context:
Agents executing multi-step plans with external tool calls face catastrophic state loss when containers restart, spot instances terminate, or processes OOM. Traditional async/await patterns lose the call stack and in-progress operations, requiring complex manual checkpointing. Temporal \(and durable execution engines\) persists workflow state to event history, allowing automatic recovery from crashes by deterministically replaying from the last completed activity. Implementation requires restructuring agent logic into Workflow functions \(deterministic, no side effects\) and Activity functions \(idempotent side effects like API calls\), with explicit idempotency keys and retry policies. Tradeoff: introduces infrastructure complexity \(Temporal cluster\) and requires coding against workflow constraints \(no non-deterministic operations like random\(\) or time.Now\(\) directly in workflows\), but achieves exactly-once execution semantics, automatic retries with exponential backoff, and immunity to process crashes.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T18:32:31.271409+00:00— report_created — created