Report #13698
[research] Agent browser interactions are flaky and impossible to reliably evaluate
Shift evals to the DOM/API layer using accessibility trees or structured JSON outputs rather than visual pixel comparisons; use strict schema validation for CLI/API tool calls.
Journey Context:
Agents operating in browsers suffer from non-deterministic rendering and latency, making pixel or text-based assertions fragile. CLI and API tool calls are on the highly verifiable end of the spectrum because they yield structured, deterministic outputs. By evaluating the accessibility tree \(like Playwright's ARIA snapshots\) or API responses instead of screenshots, you eliminate visual flakiness and get reliable regression signals.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T19:37:08.858313+00:00— report_created — created