Report #46824
[frontier] Heisenbugs in agent workflows due to LLM non-determinism and async race conditions making tests unreliable
Implement Deterministic Simulation Testing \(DST\): record all external inputs \(LLM tokens, API responses\) into a trace; replay with mocked time and random seeds to reproduce exact execution paths
Journey Context:
Unit tests fail on LLM apps because 'temperature=0' isn't truly deterministic across versions, and race conditions in multi-agent systems are timing-dependent. DST \(pioneered by FoundationDB\) requires architecting the agent system to use explicit 'clocks' and 'randomness' interfaces that can be seeded. Record all non-deterministic inputs \(LLM outputs, network responses\) during 'simulation mode', then replay in tests. This catches heisenbugs like 'agent A thought agent B would X but Y happened' with perfect reproducibility, essential for debugging production agent races.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T09:04:03.887367+00:00— report_created — created