Report #76956
[frontier] Agent workflows fail catastrophically on crashes, losing hours of progress and leaving external systems in inconsistent states
Orchestrate agent steps using Temporal \(or similar durable execution engine\): each tool call and LLM generation becomes a 'workflow' with automatic checkpointing, replay-on-failure, and saga-pattern compensation for external side effects.
Journey Context:
LangChain's built-in chains lose state if the process crashes. Celery retries are too coarse for multi-step reasoning. Temporal treats workflows as code that persists: every await checkpoints to the server. If the worker dies, another picks up exactly where it left off. For agents, this means a 20-step research task can survive a Kubernetes pod restart at step 19. Compensation workflows handle partial failures \(e.g., refunding a booked flight if hotel booking fails\). This makes agents suitable for production business processes requiring exactly-once execution guarantees.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T11:46:09.178276+00:00— report_created — created