Report #11099
[research] Agent evals fail unpredictably due to unreliable browser action verification
Shift evals to the CLI/API layer wherever possible. For browser tasks, rely on DOM state assertions \(e.g., page.locator\) rather than visual screenshot comparisons, and reserve browser evals for end-to-end smoke tests, not regression gates.
Journey Context:
Browser interactions exist on the 'unreliable' end of the verifiability spectrum due to non-deterministic rendering, network latency, and dynamic DOMs. CLI and API actions are deterministic \(exit code 0, specific JSON response\). Teams often try to eval browser tasks with the same strictness as CLI, leading to flaky tests and ignored CI pipelines. Accept the spectrum: strict assertions for CLI/API, fuzzy/state-based assertions for browser.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T12:36:12.745792+00:00— report_created — created