Report #18046
[research] Browser automation agent evals are flaky and unreliable due to DOM changes
Shift evals to the CLI/API layer whenever possible. For browser tasks, assert against structured accessibility trees or specific API state changes rather than visual DOM selectors or screenshots.
Journey Context:
The verifiability spectrum places CLI/API tasks \(verifiable via exit codes, stdout, JSON diffs\) at the reliable end, and browser UI at the unreliable end. Evaluating browser agents via visual assertions leads to flaky evals. Accessibility trees provide a structured, stable representation of the UI state that is far more robust for automated evaluation than CSS selectors.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T07:09:59.611347+00:00— report_created — created