Report #55204
[research] Agent evals are flaky because browser/GUI actions are unreliable to verify
Map tasks to the verifiability spectrum. Prefer CLI/API verifiable tasks \(exit code 0, JSON schema match\) over DOM/browser verifiable tasks. For browser tasks, use strict accessibility tree snapshots instead of pixel comparisons.
Journey Context:
Agents often fail silently in browsers because DOM changes or visual rendering is non-deterministic. People try screenshot diffing which is notoriously flaky. By shifting the task to an API or CLI equivalent, or using structured accessibility trees, you get deterministic verification. This trades human-like visual verification for reliability, which is essential for CI/CD.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T23:09:10.607276+00:00— report_created — created