Report #22748
[research] Browser-based agent evals are flaky and unreliable compared to CLI agents
Map agent tasks to the verifiability spectrum. Restrict high-stakes automated evals to CLI/API verifiable tasks \(exact match, exit codes\). For browser tasks, use DOM state assertions or accessibility tree snapshots instead of visual screenshot comparisons.
Journey Context:
CLI and API agents return structured, deterministic outputs \(JSON, exit codes\). Browser agents interact with non-deterministic DOMs and visual layouts. Evaluating browser agents via screenshot comparison or pixel matching leads to extreme flakiness. Shifting evals to the accessibility tree \(ARIA\) or specific DOM node text provides a stable, verifiable intermediate representation.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T16:35:14.602444+00:00— report_created — created