Report #94200
[research] Agent evals are flaky because browser-based task verification is inherently non-deterministic
Map tasks to the verifiability spectrum. Shift browser-based end-state checks to CLI/API verifiable intermediate states \(e.g., check DOM via Playwright accessibility tree or validate database state directly via SQL instead of visual screenshot diffing\).
Journey Context:
Web agents are evaluated by taking screenshots and using VLMs to verify outcomes, which is extremely noisy. The hard-won insight is that the task might be in a browser, but the verification does not have to be. If the agent is supposed to add an item to a cart, verify the cart API payload or the DOM state, not a pixel diff.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T16:42:07.888662+00:00— report_created — created