Report #15600
[research] Agent evals are flaky because browser-based or GUI tasks are treated with the same deterministic expectations as CLI tasks
Classify evals on a verifiability spectrum. Use exact match or deterministic scripts for CLI/API tasks. For browser/GUI tasks, use visual-as-a-judge \(VLM\) or accessibility-tree diffing, and accept a confidence threshold rather than binary pass/fail.
Journey Context:
A common mistake is writing assertions against DOM selectors or pixel-perfect screenshots for web agents, which breaks on any minor UI change. CLI outputs \(like git status or pytest results\) are highly verifiable. Browser outputs are unreliable. You must map your eval suite to this spectrum and use VLMs to evaluate the intent of the browser state rather than the exact HTML structure.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T00:38:26.616484+00:00— report_created — created