Report #43950
[frontier] Long-running agent workflows fail on context window limits or API timeouts losing all progress
Wrap agent steps in durable execution with Temporal or similar, externalizing state to durable store with explicit checkpoints
Journey Context:
Agents processing 100\+ step workflows often crash from transient API errors or context overflow, forcing restarts from step 1. Durable execution \(Temporal, Windmill, or custom\) persists the input/output of each step \(tool call, LLM response\) to a durable store. Workflows resume exactly from the last successful step, even after days or crashes. This enables 'infinite' effective context by externalizing conversation history to the durable store, and allows human-in-the-loop approval at specific checkpoints.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T04:14:32.034031+00:00— report_created — created