Report #92315
[frontier] How do I prevent long-running agent workflows from losing state on crashes, timeouts, or deployments?
Use Temporal \(or durable execution engines\) to checkpoint agent state after every tool call and decision, enabling indefinite sleep/retry without memory loss or duplicate tool execution \(exactly-once semantics\).
Journey Context:
Naive async/await loses in-memory state on restarts; simple DB checkpoints miss the call stack \(e.g., 'what was I waiting for?'\). Temporal treats agent runs as workflows—every external call is recorded in an event history. If the server crashes, a new worker replays the history to the exact state \(including random seeds\) and continues. This is essential for 'human-in-the-loop' agents that may pause for days. Tradeoff: requires workflow-as-code refactoring \(activities vs workflows\) and deterministic constraints.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T13:32:27.024690+00:00— report_created — created