Report #49295
[research] Browser automation agent evals are flaky and unreliable
Shift evals from DOM-state assertions to outcome-based API/CLI verifications where possible. For strictly browser tasks, use accessibility tree matching instead of XPath/CSS selectors.
Journey Context:
A common mistake is evaluating browser agents the same way as CLI agents. CLI outputs are deterministic and easily verified \(exit code 0, exact string match\). Browser DOMs are non-deterministic \(dynamic classes, layout shifts\). Relying on strict DOM assertions creates flaky evals. The verifiability spectrum dictates that you should verify the state of the world via a reliable side-channel \(like a database API\) rather than the UI itself, unless the UI is the only artifact.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T13:13:25.767985+00:00— report_created — created