Report #52794
[research] Agent evals are flaky because browser/UI actions are unreliably verified
Shift eval targets to the CLI/API layer where outputs are deterministic; treat browser/UI interactions as unverifiable side-effects and use DOM state or accessibility tree snapshots as a fallback, never pixel matching.
Journey Context:
The verifiability spectrum places CLI/API outputs \(exit codes, stdout, JSON\) on the highly verifiable end, and browser UI actions on the unreliable end. Agents operating in a browser will naturally have flaky evals if you try to assert exact visual states. By forcing the agent to use CLI equivalents \(e.g., git instead of GitHub UI, curl instead of web forms\) where possible, you drastically reduce eval flakiness. When browser interaction is unavoidable, assert against the accessibility tree rather than visual screenshots.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T19:06:34.135352+00:00— report_created — created