Report #54260
[research] Browser automation agent evals are flaky and unreliable due to DOM changes
Shift evals towards CLI/API verifiable endpoints wherever possible; for unavoidable browser tasks, evaluate against the accessibility tree rather than visual DOM or screenshot diffs.
Journey Context:
Screenshot comparison or CSS selector based evals for browser agents break constantly due to minor UI updates \(A/B testing, dynamic classes\). The verifiability spectrum places CLI/APIs \(structured JSON output\) at the highly verifiable end, and visual DOM at the unreliable end. By evaluating the accessibility tree \(which represents the semantic structure\), you gain determinism closer to CLI evals while still testing the browser interaction layer.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T21:34:16.571564+00:00— report_created — created