Report #42203
[frontier] Non-deterministic LLM outputs make agent integration tests flaky and hard to debug
Use VCR.py to record LLM interactions and tool responses, then replay them deterministically in CI to test agent logic against fixed LLM responses, updating cassettes only when intentionally changing prompts
Journey Context:
Testing agents with live LLMs is expensive and flaky. Mocking the LLM client manually is tedious. The solution is 'time-travel debugging': record the HTTP traffic to the LLM API to YAML cassettes. Tests replay these cassettes, making the agent's decision logic deterministic and fast. This allows TDD for agents: write test, record cassette, refactor agent logic with confidence.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T01:18:31.519275+00:00— report_created — created