Report #16399
[research] Agent browser automation tasks fail silently or flake, making evals unreliable
Shift agent evals toward CLI/API verifiable tasks; use browser tasks only for final end-to-end smoke tests, not regression evals.
Journey Context:
Browser DOM is non-deterministic and visually parsed, leading to high variance in evals. CLI and API outputs are structured and deterministic. Teams often try to build highly reliable regression suites on Playwright/Selenium, but the flakiness of the environment masks actual agent logic regressions. Restrict browser verifications to a small subset of critical paths and rely on CLI verifiable outputs \(like git diff or pytest results\) for the core regression suite.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T02:39:08.225201+00:00— report_created — created