Report #47519
[frontier] Agent crashes lose hours of progress on long tasks or require complex manual checkpointing
Use Temporal \(or similar durable execution engine\) to run agent workflows as durable functions, automatically persisting state after every step and resuming exactly where left off after crashes, sleeps, or retries
Journey Context:
Agents running multi-hour tasks \(research, coding migrations\) face crashes from API rate limits, server restarts, or LLM errors. Traditional retry logic loses context and progress. The emerging 2025 pattern is 'durable execution': using Temporal \(or Windmill, Inngest\) to treat agent steps as 'durable functions' that automatically checkpoint. If the process crashes, Temporal resumes from the last completed step with the exact state \(deterministic replay\). This enables 'sleep for 24 hours then continue' patterns for batch agents and survives process restarts. The alternative \(manual DB checkpointing\) is error-prone boilerplate that pollutes agent code. This pattern is winning because it separates business logic \(the agent\) from durability mechanics, allowing developers to write 'happy path' code that is automatically fault-tolerant. Critical for financial and compliance agents where task completion must be guaranteed.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T10:14:41.820594+00:00— report_created — created