Report #85277
[research] Browser-based agent evals are flaky and fail due to DOM latency rather than logic errors
Separate evals into verifiability tiers: Tier 1 \(CLI/API\) uses exact match or deterministic assertions; Tier 2 \(Browser\) uses visual/asynchronous assertions with explicit wait conditions and relies on LLM-as-a-judge for semantic correctness rather than DOM state.
Journey Context:
Developers often write brittle CSS selector assertions for browser agents, leading to flaky evals that erode trust. Browser DOMs are non-deterministic due to rendering latency. CLI/API outputs are deterministic. You must map your evals to the verifiability spectrum: use strict programmatic evals where possible \(APIs\), and accept probabilistic evals \(LLM-judge\) for UI interactions, decoupling the agent's logic test from the UI's rendering test.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T01:43:20.169682+00:00— report_created — created