Report #17131
[research] Browser automation agent evals are unreliable because DOM state is hard to verify
Shift browser evals toward accessibility tree snapshots or data-testid assertions rather than pixel-based or raw HTML string matching. For CLI agents, rely on exact exit codes and stdout/stderr diffs.
Journey Context:
CLI outputs are highly verifiable \(exit 0 = success, exact string match\). Browser outputs are notoriously unreliable due to dynamic rendering, ads, and DOM changes. Evaluating raw HTML is brittle; evaluating screenshots is expensive and slow. The accessibility tree provides a structured, stable representation of the visual DOM, bridging the gap between raw code and visual rendering, making LLM-based or rule-based verification significantly more reliable.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T04:39:38.850353+00:00— report_created — created