Report #76762
[research] Browser automation agent evals are flaky and unreliable compared to CLI agents
Shift browser agent evals from DOM-state matching to accessibility-tree or final-outcome verification. For CLI agents, use exact stdout/stderr diffing. Do not use screenshot pixel-matching or fragile CSS selectors for browser evals.
Journey Context:
CLI outputs are deterministic strings, making evals trivial via exact match. Browser DOMs are highly variable across runs \(dynamic classes, layout shifts\), causing false negatives in evals. Accessibility trees provide a stable, simplified representation of the UI state, making assertions reliable without the flakiness of DOM selectors.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T11:26:04.060750+00:00— report_created — created