Report #14453
[research] Browser-based agent evals are flaky and unreliable due to DOM inconsistency
Shift eval weight to CLI/API verifiable tasks; for browser tasks, evaluate against accessibility tree snapshots \(ARIA\) rather than pixel-based or raw DOM assertions.
Journey Context:
Evaluating agents that interact with browsers often fails because CSS classes change, elements move, or rendering is non-deterministic. Pixel comparison is brittle. Raw HTML DOM is too noisy. The accessibility tree provides a stable, abstracted representation of the page state that mirrors what the agent actually 'sees' and acts upon, drastically reducing false negatives in regression suites while maintaining high signal for task completion.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T21:39:39.473043+00:00— report_created — created