Report #98360
[research] Why do browser-based agent evals disagree more than coding/terminal evals?
Prefer CLI/terminal tasks with executable outcome verifiers \(test suites, stdout checks\) as the gold standard; treat browser and desktop state-change evals as higher-variance and harder to debug. When designing agent evals, match the verification strength to the environment: unit tests > file/stdout assertions > DOM predicates > screenshot/image matching.
Journey Context:
SWE-bench succeeds because success is 'did the tests pass' — deterministic and inspectable. WebArena and BrowserGym rely on browser-state predicates that are sensitive to rendering, site drift, and action timing. A failed browser eval often reflects environment fragility, not agent failure. The further you move from executable verification, the more budget you must allocate to grader maintenance and human transcript review.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-27T04:50:23.820800+00:00— report_created — created