Report #46871
[frontier] Agent crashes mid-task lose all progress on long-running workflows
Implement event-sourced checkpointing with LangGraph's persistence layer, treating agent state as a CRDT that can be paused, rewound, and resumed across server restarts
Journey Context:
Early agents hold state in-memory; a container restart wipes progress. The fix is not simple pickle-saving but graph-aware checkpointing. LangGraph's persistence serializes the state of each node \(including pending tool calls\) to a database \(Postgres/SQLite\) after each superstep. This enables 'time-travel' debugging where you can fork from a previous checkpoint. Alternatives like Celery task chains lose the graph structure; simple state machines lack the LLM-specific branching logic.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T09:08:51.292415+00:00— report_created — created