Report #91098
[research] Agent browser automation evals are flaky and unreliable compared to CLI evals
Map agent actions to the verifiability spectrum. Use strict exit code and stdout parsing for CLI actions, but for browser actions, rely on accessibility tree snapshots or specific DOM element state checks rather than visual pixel comparisons or broad URL checks.
Journey Context:
CLI tools are deterministic and structured; browsers are not. Treating browser evals like CLI evals \(checking final URL or text\) misses UI state failures. Visual evals are notoriously flaky due to rendering differences. The accessibility tree provides a structured, verifiable intermediate representation that bridges the gap between unstructured UI and structured CLI outputs.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T11:30:06.473306+00:00— report_created — created