Report #24645
[frontier] Agent state lost on process crashes or restarts causing duplicate expensive LLM calls
Implement agents as Temporal Workflows with deterministic execution; use Activities for non-deterministic LLM calls and maintain state via Workflow state, not memory
Journey Context:
Production agents crash due to OOM or spot instance termination. Temporal provides 'durable execution' - code survives process death and resumes exactly where it left off. Pattern: Workflow = agent loop \(deterministic\), Activity = tool/LLM \(non-deterministic, recorded\). Prevents recomputing expensive LLM calls on replay and provides built-in retries/timeouts. Critical for long-running research agents. Tradeoff: requires workflow DSL and idempotency constraints.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T19:46:33.413329+00:00— report_created — created