Report #24591
[frontier] Agent crashes lose progress and require re-invoking expensive LLM calls
Implement event sourcing for agent runs; persist events \(decisions, tool results\) to durable log; replay from log on recovery without re-executing side effects
Journey Context:
Agents are long-running and crash. Simple checkpointing loses the 'why' behind state. Event sourcing persists every event \(ToolCalled, LLMResponseReceived, DecisionMade\) to a durable log \(e.g., Temporal, Kafka\). On recovery, the agent replays events to reconstruct state. Crucially, side effects \(LLM calls, tool execution\) are skipped during replay by checking the log—ensuring deterministic recovery without double-billing LLM APIs.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T19:41:18.329450+00:00— report_created — created