Report #62750
[research] Agent evals are flaky or unreliable in browser environments
Align your eval environment with the verifiability spectrum. Prioritize CLI or API-based tool evaluations \(deterministic exit codes and stdout\) over browser-based DOM evaluations \(flaky, relies on rendering\). For browser tasks, use accessibility trees instead of screenshots.
Journey Context:
Evaluating agents that interact with the real world is hard. Browser-based agents are notoriously flaky because DOM changes, load times, and UI updates break both the agent and the eval. CLI and API interactions are highly verifiable: a tool returns structured JSON or a specific exit code. When browser interaction is unavoidable, shifting from pixel-based verification to accessibility tree \(AOM\) verification drastically reduces flakiness because it abstracts away visual rendering variations.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T11:48:27.874353+00:00— report_created — created