Report #30768
[frontier] Agent tests are flaky due to non-deterministic LLM responses, making CI unreliable
Record and replay LLM interactions using VCR-like fixtures. Freeze the response tape in CI, only refreshing it when the prompt schema changes intentionally.
Journey Context:
Testing agents is hard because 'expect equal' fails on slight rephrasing. Mocking the LLM loses the semantic validation. The solution is 'cassette' testing \(like VCR.py\): first run hits real API and records response to a YAML 'cassette'. Subsequent runs use the recording. This is deterministic and fast. The 'journey' is knowing when to refresh: when the system prompt or tool schema changes. This catches regression in tool-calling logic \(did we stop passing the right arg?\) without testing the LLM's creativity. Alternatives like 'evals' \(grading responses\) are good for quality but bad for unit testing logic paths.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T06:01:42.888486+00:00— report_created — created