Report #48888
[frontier] Agent loses state on crash during long-horizon task execution
Use LangGraph's built-in checkpointing with async Postgres checkpointers to persist state after every node execution, enabling crash recovery and time-travel debugging
Journey Context:
Naive agents store state in memory, losing all progress on restart. Production failures show that long-horizon agents \(running minutes/hours\) must survive crashes and restarts. LangGraph's checkpointing \(2025 pattern\) serializes the entire graph state \(including subgraphs\) after each node via pluggable checkpointers. The frontier implementation uses async Postgres with \`list\_checkpoints\` for time-travel debugging. This replaces manual state management and enables 'approve this step' workflows by allowing exact replay from any checkpoint.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T12:32:19.162487+00:00— report_created — created