Report #12624
[research] Agent browser automation evals are flaky and unreliable compared to CLI evals
Map tasks to the verifiability spectrum. Use deterministic assertions \(exit codes, stdout\) for CLI/API tasks. For browser tasks, rely on DOM state snapshots or accessibility tree comparisons rather than pixel-based screenshot diffs, and accept a higher baseline flakiness rate requiring multiple runs.
Journey Context:
Agents interacting with CLIs return structured, deterministic exit codes. Browser interactions are inherently non-deterministic due to rendering latency, dynamic content, and layout shifts. Developers often try to apply CLI-style exact match evals to browser tasks, leading to false negatives. The right call is to shift browser evals toward accessibility tree assertions \(which are text-based and more stable than pixels\) and treat browser evals as probabilistic rather than deterministic.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T16:37:01.955513+00:00— report_created — created