Report #63904
[research] Applying deterministic CLI evals to browser-based agent tasks yields false confidence
Map tasks to the verifiability spectrum: use exact match for CLI/API tasks, but rely on visual DOM snapshots or accessibility tree comparisons for browser tasks.
Journey Context:
CLI outputs \(exit codes, stdout\) are highly verifiable. Browser outputs are non-deterministic \(DOM changes, layout shifts\). Treating a browser agent's output like a CLI eval leads to incredibly flaky tests. You must snapshot the accessibility tree rather than the raw HTML, accepting probabilistic verification for probabilistic environments.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T13:44:51.541073+00:00— report_created — created