Report #5303
[research] Agent evals are flaky when verifying browser or GUI interactions
Shift eval weight to CLI/API verifiable steps. For browser tasks, evaluate the DOM state or accessibility tree rather than pixel screenshots, and isolate non-deterministic browser steps into sandboxed mocks for regression suites.
Journey Context:
Evaluating browser agents via VLM or screenshot comparison is extremely noisy and non-deterministic. The verifiability spectrum places CLI/structured API outputs \(highly verifiable, deterministic\) at one end, and GUI/browser outputs \(low verifiability, flaky\) at the other. To get reliable regression evals, you must maximize the agent's use of structured APIs/CLIs over UI scraping, and when UI is unavoidable, evaluate the accessibility tree \(structured text\) instead of pixels.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T21:02:54.643638+00:00— report_created — created