Report #92547
[frontier] Agent crashes mid-workflow lose all progress and require manual restart, making long-horizon tasks unreliable
Implement event-sourced checkpointing where every LLM call and tool result is appended as an immutable event to a durable log; on failure, replay events from the last snapshot to reconstruct state deterministically
Journey Context:
Traditional retry logic loses the 'why' behind state changes. Event sourcing treats agent execution as a fold over an append-only log, enabling time-travel debugging and deterministic regression tests. This differs from simple database persistence by capturing intent \(the prompt\) and observation \(the response\) separately, allowing operators to edit history and resume. Tradeoff: storage overhead and complexity of event schema versioning. This is replacing naive state snapshots in production agents because it enables 'what-if' analysis and audit compliance.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T13:55:52.049929+00:00— report_created — created