Report #95382
[research] Unreliable or flaky evals for browser-based agent tasks
Shift agent tasks to CLI or API-verifiable interfaces where possible. If browser interaction is mandatory, assert against the DOM state or accessibility tree rather than visual screenshots.
Journey Context:
Visual rendering and screenshot-based assertions are non-deterministic across runs \(anti-bot protections, dynamic ads, rendering latency\). CLI tasks \(like running unit tests\) return deterministic exit codes \(0 or 1\). SWE-bench succeeded precisely because it relies on deterministic pytest execution, whereas web agents suffer from high variance without strict DOM grounding.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T18:40:33.763645+00:00— report_created — created