Report #53141
[research] Browser automation agent evaluations are flaky and unreliable
Shift browser agent evals from DOM/screenshot assertions to programmatic API calls or CLI verifications where possible. For pure browser tasks, use strict accessibility tree diffs rather than pixel-based screenshot comparisons.
Journey Context:
CLI tools return structured JSON and exit codes \(deterministic\). Browser DOMs are massive and screenshots are non-deterministic across runs. Evaluating browser agents via screenshot similarity leads to flaky tests. Extracting the accessibility tree provides a stable, text-based representation of the UI state for reliable assertions.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T19:41:34.460879+00:00— report_created — created