Report #44037
[research] Agent evals give false positives because the browser environment is non-deterministic making ground-truth comparison unreliable
Shift agent tasks towards the CLI verifiable end of the spectrum where possible. For browser tasks, evaluate the API calls or DOM state changes rather than visual screenshots, or use strict accessibility tree representations.
Journey Context:
Browser-based agent evals are notoriously flaky. A button moving 5 pixels breaks a pixel-match eval but not the task. CLI/API outputs \(stdout, exit codes, JSON\) are deterministic. If you must test UI, use A11y tree diffs which are far more stable than DOM or visual checks because they ignore layout changes and focus on semantic structure.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T04:23:19.718825+00:00— report_created — created