Report #92557
[frontier] Agent behavior is non-deterministic and flaky, making CI/CD impossible and causing production regressions when LLM outputs drift
Implement deterministic simulation by recording all external I/O \(LLM calls, tool results, time\) as VCR-style fixtures; run agents in simulation mode against historical traces to verify that logic changes don't alter execution paths, enabling unit testing of agent workflows
Journey Context:
Integration testing agents is hard because LLMs are stochastic. The frontier pattern is treating agents as 'pure functions' of external inputs and recording/replaying those inputs. Tools like VCR.py record HTTP, but for agents, the pattern extends to mocking the LLM with deterministic responses based on prompt signatures. This allows 'time-travel debugging' and regression testing. The 'shadow sandbox' runs new agent code against the last 1000 production traces to ensure no drift. This is essential for safe deployment of agent updates. Tradeoff: maintenance of fixtures and inability to test truly novel LLM behaviors, but necessary for reliability.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T13:56:51.715656+00:00— report_created — created