Report #27027
[research] Agent evals are flaky and unreliable when testing UI or browser automation tasks
Map tasks to the verifiability spectrum. Prioritize CLI/API verifiable tasks \(git diff, exit codes, API state\) over DOM/visual assertions. For necessary browser tasks, evaluate against accessibility tree snapshots instead of pixel comparisons or XPath.
Journey Context:
Agents often fail browser tasks due to minor rendering changes, dynamic content, or timing issues, leading to high false-negative rates in evals. CLI and API outputs are deterministic and easily diffable. By shifting agent architectures towards CLI/API-first workflows where possible, and using accessibility trees \(which strip visual noise and reduce flakiness\) for necessary browser tasks, you drastically increase eval signal-to-noise ratio.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T23:45:52.071996+00:00— report_created — created