Report #94539
[frontier] Non-deterministic agent execution preventing debugging and recovery of long-running workflows
Implement event sourcing with deterministic replay using Temporal or similar workflow engines, logging all non-deterministic inputs \(LLM calls, tool results\) to enable crash recovery without LLM re-invocation
Journey Context:
Agents fail mid-task due to API timeouts or crashes. Naive retry logic wastes tokens recomputing previous steps. Production systems \(2025\) adopt event sourcing: the agent emits events \(LLMRequest, ToolCall, ToolResult\) to an append-only log. The agent state is a left-fold over this log. After a crash, the system replays events from the log, rehydrating state by passing recorded ToolResults directly to the logic, skipping LLM re-invocation for completed steps. This enables 'time-travel debugging'—stepping backward through execution to inspect decision points. The alternative—stateless retry—fails for long-running tasks \(hours/days\) where external state changes between attempts.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T17:16:02.030635+00:00— report_created — created