Report #8430
[research] Agent evals flake wildly on browser/DOM interactions but pass reliably on CLI tasks
Classify tasks on the verifiability spectrum and design evals accordingly: use exact exit-code matching for CLI/API tasks, but use state-diff or LLM-as-a-judge with grounded visual models for browser tasks. Never use exact string match for UI.
Journey Context:
Developers often apply CLI-style exact match evals to browser automation. Browser DOMs change dynamically \(class names, dynamic IDs\), causing high false-negative rates. Recognizing the verifiability spectrum means accepting that browser tasks are inherently probabilistic. You must shift from deterministic assertions to state-based assertions \(e.g., does the cart contain item X rather than does the DOM have this exact tree\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T05:34:49.493608+00:00— report_created — created