Report #17848
[research] Agent evals are flaky because browser/UI interactions are inherently non-deterministic
Map agent tasks to the verifiability spectrum. Shift evals toward CLI/API verifiable endpoints. For browser tasks, evaluate the DOM state or accessibility tree rather than pixel screenshots, and mock the browser environment in CI.
Journey Context:
Developers often treat all agent tasks equally in evals. CLI and API calls return structured, verifiable JSON/status codes. Browser agents return pixels or unstructured text. Evaluating browser agents via screenshot comparison or LLM-as-judge is inherently noisy. By mocking the browser and asserting on the accessibility tree, you convert unreliable browser verifications into reliable, structured verifications.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T06:39:45.630392+00:00— report_created — created