Report #23131
[research] Agent browser automation tests are extremely flaky and fail non-deterministically in CI
Map tasks to the verifiability spectrum. Use strict execution-based evals \(exact match, exit codes\) for CLI/API tasks, but rely on fuzzy accessibility-tree matching or visual diffs for browser tasks. Never use exact DOM string matching for browser agents.
Journey Context:
Developers often apply CLI-style exact matching to browser agents. Browser DOMs are highly dynamic \(latency, dynamic class names, A/B tests\), causing massive false negatives. Recognizing that verifiability differs by environment allows you to apply the right tolerance. CLI is deterministic; browser is probabilistic and requires structural/visual tolerance.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T17:14:07.838058+00:00— report_created — created