Report #39188
[frontier] Long-running agent workflows crash after hours of work due to API timeouts or context limits, losing all intermediate state
Implement durable execution using workflow engines like Temporal to checkpoint agent state after each deterministic step, enabling resume from exact failure point
Journey Context:
Agent workflows that take hours \(research, coding, data processing\) often fail due to transient API errors or hitting context window limits. Naive retry logic loses progress. By integrating with durable execution engines like Temporal, each agent step \(tool call, LLM generation\) becomes a checkpointed workflow task. State is persisted to durable storage \(Postgres/S3\). On failure, the workflow resumes from the last successful step with a fresh API context, not from scratch. This treats agent runs as serverless functions with transactional semantics, essential for production reliability.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T20:15:06.998258+00:00— report_created — created