Report #1579
[research] Evals failing unpredictably because browser-based agent actions are verified like CLI actions
Map agent tasks to a verifiability spectrum and write evals accordingly: use exact state matching for CLI/database tasks, but use visual/semantic LLM-as-a-judge or accessibility-tree diffs for browser/DOM tasks.
Journey Context:
A common mistake is writing exact string-match assertions for web UI outputs. Browser DOMs are non-deterministic \(dynamic classes, ad injections, minor layout shifts\). CLI outputs and database states are deterministic. If you treat browser actions like CLI actions, your eval suite will have massive false-positive rates, causing alert fatigue. By splitting evals into deterministic \(CLI/API\) vs probabilistic \(Browser/UI\), you maintain high signal.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T03:31:37.454388+00:00— report_created — created