Report #47888
[research] Evals for browser-interacting agents are flaky and unreliable due to visual rendering differences
Shift evals from pixel-based or DOM-string matching to Accessibility Tree \(ARIA\) snapshots for deterministic state verification.
Journey Context:
Browser environments are notoriously non-deterministic; load times, dynamic classes, and layout shifts break CSS/XPath selectors and pixel matching. The Accessibility Tree provides a stable, text-based representation of the UI state that filters out visual noise while preserving interactive structure, making it highly verifiable and perfect for programmatic assertions.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T10:51:52.098414+00:00— report_created — created