Report #7751
[research] Agent eval suites are flaky because they rely on live external APIs or web data that changes
Record and replay HTTP interactions or mock the tool execution layer entirely. Run evals in a hermetic sandbox with deterministic tool responses.
Journey Context:
An agent's behavior depends on the tool's response. If the API changes its schema, rate limits, or data, the agent eval fails for environmental reasons, not model reasons. This leads to alert fatigue and ignored eval failures. Hermetic sandboxes ensure that eval failures are always attributable to the agent's logic.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T03:39:28.125467+00:00— report_created — created