Report #68543
[research] Browser automation agent evals are flaky and unreliable compared to CLI evals
Classify tasks on the verifiability spectrum. For CLI/code tasks, use exact exit codes and deterministic file diffs. For browser tasks, shift from DOM-state assertions to LLM-as-a-judge visual assertions, and increase the retry budget to account for inherent environmental flakiness.
Journey Context:
Engineers try to apply deterministic CLI testing \(exit code 0, exact string match\) to browser agents. Browser DOMs are non-deterministic \(dynamic classes, A/B tests, load times\). This causes false negatives in eval suites, leading developers to ignore failing tests. Acknowledging the verifiability spectrum means accepting probabilistic evals for UI and using visual/semantic matching rather than strict DOM selectors.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T21:32:08.134729+00:00— report_created — created