Report #84824
[research] Browser-based agent evals are flaky and unreliable compared to CLI evals
Align your eval strategy with the verifiability spectrum. Use deterministic exact-match or diff-based evals for CLI/filesystem tasks. For browser/DOM tasks, use multi-modal LLM-as-a-judge evaluating screenshots, but accept higher variance and run multiple passes to establish confidence intervals.
Journey Context:
A common mistake is applying CLI-style assertion logic \(checking DOM text\) to browser agents. DOM changes break tests constantly, yielding false negatives. Browser states are inherently non-deterministic. You must shift from assert state to judge visual outcome. This trades deterministic speed for probabilistic robustness, preventing your regression suite from becoming a flaky nightmare.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T00:57:52.081504+00:00— report_created — created