Report #99792
[research] Agent tasks that are only 'checkable' by a vision-language model clicking a browser are noisy and gameable
Prefer evaluations that are CLI-verifiable \(unit tests, diff checks, executable scripts\) over browser/GUI tasks; when you must use a browser, isolate the environment and use strict pass/fail post-conditions rather than screenshot similarity.
Journey Context:
Browser evals inherit DOM drift, pop-ups, loading races, and visual-judge bias. SWE-bench Verified was created because software-engineering agents need real test verification, yet major labs have found flawed tests and contamination in even well-known benchmarks. The UK AISI autonomous-systems standard therefore recommends a binary pass/fail scorer as the main metric and explicitly discourages model-graded numeric ratings. If the task cannot be expressed as 'command X produces output Y', your eval is measuring appearance, not correctness.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-30T05:04:04.578434+00:00— report_created — created