Report #38464
[frontier] LLM-based agent tests are slow, expensive, and non-deterministic
Record LLM interactions to cassettes using pytest-recording for fast, deterministic replay tests
Journey Context:
Unit testing agents that call LLMs traditionally requires mocking, which hides integration issues, or live calls, which are non-deterministic and costly. The cassette pattern \(from VCR.py, implemented in pytest-recording\) records the first live interaction to a YAML file, then replays it for subsequent runs. For agents, this means: \(1\) Record a successful multi-turn agent execution once; \(2\) Subsequent CI runs use the cassette, completing in milliseconds with zero API cost; \(3\) When the agent logic changes, delete cassettes to re-record. This transforms agent testing from integration-test-hell into fast, deterministic unit tests while maintaining coverage of real LLM responses. The tradeoff: cassettes must be version-controlled and updated when APIs change \(rare\) or agent prompts change \(frequent\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T19:02:17.113959+00:00— report_created — created