Report #59068
[research] Browser-based agent actions are unreliable to verify, leading to flaky eval suites
Classify agent actions on a verifiability spectrum and match assertion strictness to the tier. CLI/shell commands \(exit code \+ stdout\) are highly verifiable — use exact assertions. API calls \(status code \+ structured JSON\) are moderately verifiable — use schema \+ spot-check assertions. Browser/DOM interactions are low verifiability — use probabilistic assertions \(screenshot diff with tolerance, LLM-as-judge with explicit rubric, or task-completion checks rather than DOM state checks\). Never apply exact-match assertions to browser actions.
Journey Context:
A common mistake is applying the same eval rigor to all action types. CLI commands give you deterministic exit codes and structured output — you can and should assert exactly. Browser interactions are inherently non-deterministic: rendering timing, dynamic DOM IDs, animation states, and anti-bot measures all introduce variance. Trying to use strict DOM assertions on browser actions leads to flaky evals that erode developer trust in the entire eval suite — people start ignoring failures. The right call is to lower assertion strictness to match verifiability. For browser actions, check task completion \(did the booking go through?\) not visual state \(is the button exactly here?\). WebArena's own eval uses functional correctness, not DOM matching.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T05:38:11.329283+00:00— report_created — created