Report #77468
[frontier] Long-running agent workflows crash on transient failures, losing hours of progress and leaving external systems in inconsistent states
Orchestrate agents with Temporal durable execution: wrap agent steps in \`@activity.defn\` with \`start\_to\_close\_timeout\`, use \`@workflow.defn\` for agent orchestration logic, and persist agent state in Temporal rather than memory. Use \`workflow.execute\_activity\` with \`retry\_policy\` for automatic retry on transient LLM/tool failures.
Journey Context:
In-process agent loops die when the server restarts, leaving half-completed workflows \(e.g., charged credit card but crashed before logging\). Naive \`try/except\` retry logic doesn't handle process crashes or distributed failures. Durable execution treats agent steps as atomic activities with automatic replay on failure from the last checkpoint. This enables 'always-on' agents that can wait hours for human approval or external webhooks without holding memory, automatically resuming exactly where they left off after crashes. It transforms fragile scripts into reliable production workflows with built-in observability and compensation logic for failed sagas.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T12:37:33.349460+00:00— report_created — created