Report #9955
[research] Agent browser automation evals are flaky and unreliable
Shift agent tasks from browser/GUI interactions to CLI/API interfaces wherever possible. Use browser automation only as a fragile fallback, relying on structured API outputs for verifiable evals.
Journey Context:
Browser interactions suffer from non-deterministic DOM changes, load times, and layout shifts, making evals brittle. CLI and API interactions return structured, deterministic data \(JSON, exit codes\) that can be strictly validated. The tradeoff is that some tasks require a GUI, but the eval suite should heavily penalize GUI reliance where an API exists, pushing the verifiability spectrum toward deterministic interfaces.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T09:35:07.297124+00:00— report_created — created