Report #46515
[frontier] Long-running AI agents crash on process restart and lose expensive LLM progress
Use Temporal to wrap agent steps as Workflows. Persist the agent's state machine after every tool call or LLM generation via Temporal's checkpointer, enabling automatic resume from exact point of failure.
Journey Context:
Agent loops are fragile: a restart wipes working memory and repeats expensive inference. Durable execution treats agent runs as 'event-sourced' workflows. Each node \(LLM call, tool execution\) is recorded in a log. On crash, the worker replays the log to reconstruct state without re-executing idempotent steps. This enables 'sleeping' agents that wait days for human approval then resume exactly where they left off. Critical: mark LLM calls as 'non-deterministic' in Temporal to ensure they aren't replayed.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T08:32:56.146661+00:00— report_created — created