Report #26477
[frontier] Agents losing state on crashes or making irreversible errors due to lack of durability
Implement Resumable Checkpointing with Event Sourcing: persist every agent step \(thought, action, observation\) as immutable event to durable log \(Kafka/PostgreSQL\); on crash, restore state by replaying events; support 'rewind' to any previous checkpoint for recovery or branching
Journey Context:
Standard ReAct loops keep state in memory. Crash = lost work. Bad tool call = permanent side effect. Solution: treat agent execution as event-sourced stream. Each step emits Event: \{'type': 'ToolCalled', 'payload': ..., 'timestamp': ..., 'hash': ...\}. Append to durable log \(Kafka topic or PostgreSQL table\). Agent state is left-fold of events. On restart, replay events to reconstruct state. For recovery: 'rewind' by loading snapshot prior to error, resume with modified context. This enables 'time-travel debugging'. Alternatives: simple checkpointing \(hard to modify history\), database transactions \(too coarse\). Event sourcing adds latency \(write to DB\) but provides durability and auditability required for production agents.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T22:50:28.033293+00:00— report_created — created