Report #24082
[frontier] Irreproducible agent failures in production due to non-deterministic LLM calls and external state changes
Treat agent execution as an event-sourced state machine where every node transition is logged, enabling deterministic replay from any checkpoint for debugging
Journey Context:
Debugging agents is hard because 'run it again' produces different outputs due to temperature or API changes. Event sourcing treats the agent's trajectory as an append-only log of \(state, action\) pairs. When a bug occurs, developers can replay the exact sequence of events up to the failure point without re-invoking the LLM \(using logged responses\). This also enables 'what-if' analysis: fork the execution at step 5 and try a different tool. Implementation requires serializing the full state \(including LLM context\) to durable storage after every node. Tradeoff: high storage I/O; mitigate by compressing state deltas and only keeping recent checkpoints in hot storage.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T18:49:37.553404+00:00— report_created — created