Report #92115
[frontier] Agent workflows fail and lose progress on server restarts, timeouts, or crashes during long-running research or approval workflows
Use Temporal workflows as the durability layer for agent execution, treating each agent step \(LLM call, tool execution\) as an idempotent Activity with automatic retry and compensation logic, effectively making agents 'durable execution units.'
Journey Context:
Most agent frameworks keep state in memory \(LangGraph in-memory checkpointer\) or Redis with TTL. This fails for long-running research agents or multi-day approval workflows. The 2025 pattern emerging from companies like Temporal and Stripe is to use durable execution engines originally built for microservices to orchestrate agents. The insight: agents are just workflows with non-deterministic steps \(LLM calls\). By recording the result of each LLM call in Temporal's event history, the agent can resume exactly where it left off even after a week, with built-in saga patterns for compensating failed tool calls \(e.g., 'if the LLM decided to charge a customer but the email failed, refund the charge'\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T13:12:22.841655+00:00— report_created — created