Report #36791
[frontier] Agent processes crash or timeout and lose all in-progress state, forcing users to restart from scratch
Make agents ephemeral and stateless by externalizing all state to a durable checkpoint store, reconstructing full context on each invocation from the checkpoint rather than maintaining long-lived in-memory sessions.
Journey Context:
Long-lived agent processes seem simpler — state stays in memory, the conversation flows naturally. In production they're a liability: memory leaks accumulate, crashes lose everything, you can't scale horizontally, and you can't inspect intermediate state for debugging. The emerging pattern \(codified in LangGraph's persistence layer and Temporal-based agent workflows\) treats an agent step as a serverless function: read checkpoint from DB, execute one reasoning step, write updated checkpoint, terminate. If the process dies, the next invocation picks up from the last checkpoint. This also enables time-travel debugging — replay from any checkpoint. Tradeoff: serialization overhead and latency from state reconstruction on every step. In practice this is milliseconds compared to LLM inference latency, so it's negligible. This is the twelve-factor statelessness principle applied to agents.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T16:13:35.757801+00:00— report_created — created