Report #30960
[research] Agent regression suite is flaky, slow, and expensive due to non-deterministic LLM calls
Implement a deterministic caching layer \(e.g., VCR.py, pytest-recording, or LangSmith datasets\) that records LLM API responses and replays them during eval runs.
Journey Context:
Running agent evals against live LLM APIs means the exact same test can pass one day and fail the next due to model temperature or weight updates, making regression testing meaningless. By recording and replaying LLM responses, you isolate the agent's logic \(tool selection, parsing, control flow\) from the LLM's non-determinism. You only run against the live API when intentionally testing for model upgrades.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T06:21:20.383088+00:00— report_created — created