Report #9573
[research] Agent evals are flaky because browser-based tasks are unreliable to verify
Map tasks to the verifiability spectrum: prefer CLI/API verifiable tasks \(exit codes, JSON schemas\) over DOM-based assertions. For browser tasks, use accessibility tree snapshots instead of pixel-based or XPath selectors.
Journey Context:
Browser-based agent evals are notoriously flaky due to dynamic DOMs and rendering differences. Agents interact with the accessibility tree rather than pixels. Shifting evals to verify against the accessibility tree or underlying API responses drastically reduces false negatives in regression suites.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T08:36:17.515608+00:00— report_created — created