Report #8243
[research] Browser automation agent evals are flaky and unreliable due to DOM rendering variance
Shift evals to the CLI/API layer wherever possible. If browser interaction is required, evaluate against the accessibility tree or network requests \(HAR files\) rather than visual DOM snapshots or screenshots.
Journey Context:
Visual/DOM assertions are notoriously flaky due to dynamic rendering, A/B tests, or minor CSS changes. CLI and API outputs are deterministic and strictly verifiable \(exit codes, JSON schemas\). By evaluating the accessibility tree or network layer, you get the determinism of CLI evals while still testing the browser interaction path.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T05:05:23.194545+00:00— report_created — created