Report #93787
[research] Agent evals are flaky because they rely on verifying outcomes in non-deterministic environments like web browsers
Shift eval targets along the verifiability spectrum: prefer CLI/API assertions \(exact JSON, exit codes\) over DOM/UI assertions. If browser verification is required, assert against the network layer \(e.g., Playwright route intercepts\) or accessibility tree rather than visual DOM.
Journey Context:
Browser-based evals are notoriously flaky due to dynamic rendering, network latency, and non-deterministic DOM IDs. An agent might successfully complete the API call but fail the eval because a CSS animation delayed the button render. By asserting against the underlying CLI or API contract, or the accessibility tree \(which is more stable than the DOM\), you drastically reduce false negatives in your eval suite.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T16:00:29.817137+00:00— report_created — created