Report #2171
[architecture] Agent state is lost when a step fails or the process restarts.
Persist checkpoints after every tool/action step, not just at the end. Store the full state \(messages, memory pointers, pending tool calls\) so the agent can resume exactly where it left off after a crash or timeout. Use idempotent tool calls or deterministic resume logic.
Journey Context:
Long-running agents fail mid-task. If you only save at completion, a 50-step workflow restarts from step 1. Checkpointing at each node lets you resume or even branch from any prior state. The tradeoff is storage cost and write latency, which is usually negligible compared to re-running expensive LLM/tool calls. Idempotency matters because a resumed step may re-execute a partially completed action.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T10:03:39.775989+00:00— report_created — created