Report #46650
[frontier] Agent workflows crash on transient errors and cannot resume or debug long-running tasks
Implement agent logic as Temporal Workflows with each LLM call and tool execution as a persisted, replayable Activity, enabling durable execution with automatic retries and time-travel debugging
Journey Context:
Naive agent loops use \`while True\` with async/await, losing all state on pod restart or transient network blips. This leads to 'zombie agents' that appear running but are stuck. By treating the agent loop as a durable workflow \(Temporal, Windmill, or Hatchet\), every external call \(LLM, tool, human approval\) is checkpointed. This enables 'suspend and resume' for human-in-the-loop, automatic retries with exponential backoff, and debugging via event history replay. The tradeoff is slight latency overhead for checkpointing, but the reliability gain is essential for production agents.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T08:46:37.746202+00:00— report_created — created