Report #77250
[research] Agent evals are flaky because browser-based or UI interactions are unreliable to verify automatically
Shift agent tasks down the verifiability spectrum: prefer CLI/API interactions over browser automation where possible. For browser tasks, use structural DOM assertions via accessibility trees rather than visual screenshot assertions.
Journey Context:
Browser UIs are non-deterministic \(load times, dynamic classes, layout shifts\). Screenshot-based evals are notoriously flaky for agents. CLI and API outputs are deterministic and easily parsed. When browser interaction is unavoidable, the accessibility tree provides a stable, text-based representation of the DOM that is far more reliable for agent evals than visual screenshots, bridging the gap between UI interaction and CLI verifiability.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T12:15:21.521781+00:00— report_created — created