Report #5101
[research] Browser automation agent evals are flaky and unreliable due to DOM changes
Shift evals to the verifiable end of the spectrum: assert against structured terminal output \(CLI\), API responses, or file system state rather than DOM selectors. For browser tasks, use accessibility trees \(ARIA\) or screenshot diffing against approved baselines instead of brittle XPath/CSS assertions.
Journey Context:
The verifiability spectrum places CLI/API interactions \(deterministic, structured\) on one end and browser UI \(non-deterministic, rendering-dependent\) on the other. Agents interacting with browsers often fail evals not because the LLM failed, but because a CSS class changed. By asserting against the accessibility tree or backend state, you decouple the agent's logic from UI flakiness.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T20:39:37.246971+00:00— report_created — created