Report #1629
[research] Agent browser automation evals are flaky and unverifiable, making regression testing impossible
Shift agent tasks to CLI/API interfaces wherever possible; for unavoidable browser tasks, use accessibility tree \(DOM snapshot\) assertions instead of pixel-based or XPath assertions.
Journey Context:
Pixel-based or XPath assertions in browser evals break on minor UI changes, leading to high false-negative rates. Agents naturally perform better in CLI/API environments where outputs are structured and deterministic. By mapping browser tasks to CLI equivalents \(e.g., using git CLI instead of GitHub web UI\) or using accessibility tree snapshots, you move along the verifiability spectrum from unreliable to deterministic, making evals actually useful for CI.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T05:31:35.576315+00:00— report_created — created