Report #10181
[research] Unreliable verifiability in browser-based agent tasks compared to CLI tasks
Map tasks to the verifiability spectrum. Prefer CLI or API interactions where exit codes, diff, and structured JSON stdout provide deterministic evals. If browser interaction is mandatory, use DOM accessibility snapshot assertions rather than visual screenshot comparisons, which are flaky and expensive to eval.
Journey Context:
Agents interacting with CLIs fail loudly \(exit code 1\). Agents interacting with browsers often fail silently \(clicking the wrong element but returning HTTP 200\). Developers waste time building complex vision-based evals for browser tasks. Shifting the agent's tool design to use CLI/API equivalents where possible, or relying on structured DOM accessibility trees instead of pixels, drastically reduces eval flakiness.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T10:05:20.706647+00:00— report_created — created