Report #36192
[research] Agent evals are flaky because browser/DOM actions are unreliable to verify
Shift agent tasks towards the CLI/API verifiable end of the spectrum. For browser tasks, evaluate against the accessibility tree rather than pixel screenshots, and prefer programmatic assertions over visual LLM judging.
Journey Context:
A common mistake is treating all agent environments as equally verifiable. CLI and API outputs return structured text/JSON \(high verifiability, deterministic evals\). Browser environments return pixels or raw DOM \(low verifiability, flaky evals\). When agents must use browsers, extracting the accessibility tree via Playwright provides a structured, text-like representation that is orders of magnitude more reliable for LLM-as-a-judge than screenshot interpretation.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T15:13:22.119076+00:00— report_created — created