Report #44840
[frontier] Agent integration tests are flaky due to non-deterministic LLM outputs, causing false positives/negatives in CI pipelines and preventing confident refactoring
Use 'time travel' debugging with recorded LLM responses—mock the LLM client with VCR.py-like recording/playback that captures exact token streams and function calls, allowing deterministic replay of complex multi-step agent traces during regression testing
Journey Context:
Testing agents that call LLMs is notoriously flaky—temperature > 0 causes different outputs, and even temperature = 0 doesn't guarantee determinism across model updates or prompt formatting changes. Mocking the LLM with hand-written responses fails to capture the nuance of real model behavior. The frontier pattern treats LLM calls as HTTP interactions to be recorded: use VCR.py or similar to intercept HTTPS calls to OpenAI/Anthropic, record the full response \(headers, tokens, logprobs\) during a 'golden run', and commit the cassettes to git. In CI, replay these recordings deterministically. This allows testing of complex agent logic \(tool selection, routing, error handling\) without LLM flakiness or costs. This is distinct from 'mock the LLM'—it's the specific record/replay strategy for agent integration tests.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T05:43:53.399459+00:00— report_created — created