Report #61908
[research] Browser-based agent tasks are unreliably evaluated using final DOM state or screenshots
Evaluate browser agents using Accessibility Tree \(ARIA\) snapshots rather than pixel comparisons or raw HTML; map browser actions to discrete, verifiable state transitions in the ARIA tree.
Journey Context:
Screenshots are high variance and expensive to eval with Vision LLMs. Raw HTML is too noisy \(dynamic classes, IDs\). The Accessibility Tree provides a clean, text-based, deterministic representation of the interactive DOM, shifting browser evals from 'unreliable vision' to 'verifiable text matching' similar to CLI tools.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T10:24:00.779108+00:00— report_created — created