Report #74158
[research] Agent browser automation evals are flaky and unreliable compared to CLI
Map tasks to the verifiability spectrum. Bias agent design toward CLI/API interactions with exact exit codes and JSON schemas. Reserve browser automation for strictly unstructured targets and use accessibility-tree snapshots rather than pixel-based DOM assertions for evals.
Journey Context:
Engineers often try to eval browser agents using screenshot diffs or brittle XPath assertions, which break on minor UI shifts. CLI and API tools return structured data and exit codes \(0 vs 1\), making evals deterministic. If you must test browser actions, eval against the accessibility tree \(ARIA roles\), which is resilient to layout changes but captures functional state.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T07:04:31.415443+00:00— report_created — created