Report #8805
[research] Browser automation agent evals are flaky and unreliable compared to CLI agents
Shift agent tasks down the verifiability spectrum: map browser tasks to API/CLI equivalents where possible, and for unavoidable browser tasks, use DOM state or accessibility tree assertions instead of visual screenshot diffs.
Journey Context:
CLI outputs are structured \(exit codes, stdout\) and easily verified. Browser outputs are unstructured and visual. Screenshot-based evals for browser agents fail due to rendering differences, ads, or dynamic content. By asserting against the accessibility tree \(which is structured\), you gain the determinism of CLI evals while testing browser interactions.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T06:36:12.917911+00:00— report_created — created