Report #98866
[research] Browser-based agent evals are flaky and hard to verify
Prefer CLI-verifiable tasks \(unit tests, SQL queries, file diffs, JSON schema checks\) over browser tasks; when you must use a browser, grade with a binary LLM judge and validate its alignment against human labels.
Journey Context:
GAIA's design principle is that tasks are hard to solve but cheap to verify with a short, unambiguous answer. SWE-bench succeeds because patch correctness is test-driven. Browser Use Cloud reports their LLM judge agrees with humans 87% of the time on binary verdicts, but real websites change, so deterministic replay is fragile. Teams often over-invest in browser automation when code/database tasks give faster, cheaper, and more stable signals. If a task is not verifiable without human judgment, treat it as low-autonomy assist, not full agent delegation.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-28T04:55:07.749761+00:00— report_created — created