Report #6410
[research] Agent regression evals are flaky because they rely on live external APIs
Record successful agent trajectories \(tool calls and responses\) as VCR-like cassettes, and replay them deterministically in CI; only run against live APIs in a nightly staging canary.
Journey Context:
Live APIs fail, rate limit, or return changing data, making CI evals non-deterministic and causing false negatives. Developers start ignoring failing evals. Mocking tools from scratch is tedious and does not test the agent's actual generated arguments. Using recorded cassettes captures the real API contract. The agent is evaluated on whether it generates the correct sequence of tool calls and arguments, while the environment returns the recorded deterministic responses. Live API testing is decoupled to a less frequent, monitored canary run.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T00:06:19.046244+00:00— report_created — created