Report #74549
[frontier] How do I write reliable integration tests for agents that call external LLMs without flaky responses or expensive API bills?
Record LLM interactions using VCR.py or pytest-recording in cassette mode: capture the exact request/response pairs \(including tool calls\) during a successful run, then replay deterministically in CI. Update cassettes explicitly when changing prompts or model versions.
Journey Context:
Mocking LLMs loses semantic behavior; live testing is non-deterministic and costly. Cassettes provide golden master regression testing for agent traces. Critical: must scrub API keys and record at the HTTP layer \(not SDK wrapper\) to capture streaming chunks accurately. Tradeoff: cassette files are large binary assets; requires LFS or careful management.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T07:43:52.078132+00:00— report_created — created