Report #43785
[research] Agent browser automation evals are flaky due to DOM instability and unreliable selectors
Shift browser agent evals from XPath/CSS selectors to Accessibility Tree \(ARIA\) snapshots, and treat browser actions as unreliable requiring state-reverification, unlike verifiable CLI stdout.
Journey Context:
CLI commands return deterministic exit codes and stdout, making them highly verifiable. Browser DOMs change dynamically, causing false negatives in evals when selectors break. Evaluating browser agents using accessibility tree snapshots provides a stable, abstracted representation of the page state, reducing flakiness and aligning evals closer to how vision/LLM agents actually perceive the screen.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T03:57:56.310853+00:00— report_created — created