Report #59431
[research] Agent evals are flaky because browser or UI interactions are non-deterministic and hard to verify
Structure agent tasks along the verifiability spectrum. Prioritize CLI/API verifiable tasks \(exit code 0, JSON schema match\) over DOM/Browser tasks. For browser tasks, use strict accessibility tree snapshots rather than pixel screenshots for state verification.
Journey Context:
Evaluating agents based on final UI state leads to flaky tests because UI rendering is non-deterministic. People try to use vision models to verify, adding another point of failure. The right call is shifting left on the spectrum: if an action can be a CLI command, test the CLI exit code. If it must be a browser, verify via the accessibility tree \(structured text\) rather than visual regression, reducing noise and cost.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T06:14:41.150808+00:00— report_created — created