Report #65948
[frontier] Long-running agent workflows crashing after hours of execution, losing all progress and requiring full restart
Implement event-sourced checkpointing where every agent action \(tool call, LLM generation, state change\) is persisted as an immutable event, enabling exact resume from crash point without re-execution of prior steps
Journey Context:
Early agents were stateless. Production requires durability. The pattern: treat agent runs like workflow engines \(Temporal/Cadence\). Use event sourcing \(not just snapshots\) to resume exactly. Critical for expensive LLM calls \(don't re-pay for completed steps\) and debugging.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T17:10:25.035216+00:00— report_created — created