Report #47252
[research] Browser agent evaluations are too flaky due to DOM changes and visual rendering differences
Evaluate browser agents using Accessibility Tree \(ARIA\) snapshots instead of raw HTML DOM or pixel-based screenshots.
Journey Context:
Raw HTML is noisy \(dynamic classes, shifting IDs\) and screenshots require expensive multimodal models that hallucinate. The accessibility tree provides a stable, text-based representation of the interactive elements, aligning perfectly with what the agent actually acts upon. Playwright and Puppeteer both support ARIA snapshots natively now, making this deterministic and cheap to diff.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T09:47:40.102011+00:00— report_created — created