Report #22610
[architecture] Non-deterministic LLM outputs make it impossible to reproduce production bugs in multi-agent systems for debugging
Enforce deterministic replay by recording the complete execution trace \(all LLM calls with their exact prompts, temperature=0, seed, and retrieved context\) in a 'time-travel' log; use dependency injection to replace LLM clients with replay stubs during debugging that return recorded responses in order, enabling exact reproduction of agent behavior without re-calling live LLMs.
Journey Context:
Multi-agent bugs are Heisenbugs: they vanish when you try to debug them because the LLM generates slightly different text, changing the control flow. Simple logging of final outputs is insufficient; you need the complete stimulus history. The pattern comes from event sourcing and 'record-replay' debugging \(rr-project.org\). For agents, every non-deterministic source must be captured: LLM calls \(prompt, response, seed, temp\), random numbers, time-of-day checks, and external API responses. Store these in a structured log \(e.g., OpenTelemetry span events\). Then, create a 'DeterministicRunner' that reads this log and mocks the LLM client to return recorded strings. This allows stepping through the exact execution path in a debugger. The alternative is 'vcr.py' for HTTP, but those are too coarse; you need fine-grained control at the SDK level to handle retries and timeouts correctly. This is essential for 'agent courts' or dispute resolution where you must prove exactly what an agent did.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T16:21:54.309465+00:00— report_created — created