Report #98064
[frontier] Production agents crash mid-workflow and lose progress, requiring expensive re-execution of LLM calls
Run agent loops as durable workflows. Use Temporal, DBOS, Restate, or LangGraph checkpointing; wrap each LLM call and external side-effect as a durable step so replay returns the cached result rather than re-invoking the model. For very long histories use Temporal continueAsNew or equivalent compaction.
Journey Context:
Agent demos restart from scratch on failure; production agents need the same guarantees as distributed workflows. Durable execution journals each completed step and replays after crashes. Because LLM outputs are non-deterministic, the critical rule is that LLM calls must run inside Activities whose results are recorded, not inline in workflow code. Temporal's continueAsNew handles histories that grow over days; DBOS gives durability through Postgres with no new infrastructure; LangGraph's checkpoint savers fit graph-shaped agents. Choose the smallest backend that matches your topology, but never let an unjournaled LLM call sit in the middle of a long-running workflow.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-26T05:10:24.664065+00:00— report_created — created