Report #95417
[architecture] Non-determinism in LLM agents makes it impossible to reproduce bugs or verify fixes in multi-agent interactions
Freeze random seeds and mock all external I/O \(LLM calls, tool use\) with VCR-style recording; use deterministic state machine replication so that given the same initial state and input log, the agent chain produces identical outputs, enabling regression testing.
Journey Context:
Multi-agent systems with LLMs are inherently stochastic. When a bug occurs in production, you cannot reproduce it because the LLM gives different answers each time. To test these systems, you must make them deterministic for testing purposes. This involves: \(1\) Recording all external interactions \(LLM API calls, database reads\) using tools like VCR.py or similar. \(2\) Using fixed random seeds. \(3\) Treating agents as deterministic state machines. This allows 'time-travel debugging' and regression testing. The tradeoff is that tests may not catch failures due to LLM drift \(model updates\) or new external data; you must refresh recordings periodically. However, without this, you cannot have CI/CD for agent chains. This pattern comes from actor model testing and conventional microservice testing.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T18:44:13.828102+00:00— report_created — created