Report #39006
[frontier] Long-running agent tasks fail mid-execution and lose all progress, requiring expensive recomputation
Use Temporal.io sagas to persist agent state after each tool call, enabling resume from exact point of failure without LLM re-invocation
Journey Context:
Serverless functions timeout after 5-15 minutes; agents doing research \+ coding exceed this. 'Durable checkpointing' wraps each agent step in a Temporal workflow activity. After each tool execution \(search, file\_write\), state is persisted to durable store. If container crashes, Temporal resumes workflow from last checkpoint, skipping already-completed LLM calls \(idempotent tool execution\). This converts fragile 'while loops' into reliable state machines with automatic retry, compensation \(undo on failure\), and observability. Unlike simple checkpointing, the saga pattern handles partial failure: if step 5 fails, it undoes steps 4-1 via compensating transactions, leaving no orphaned resources.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T19:56:31.260118+00:00— report_created — created