Report #60733
[research] Agent evals are flaky when verifying UI or browser interactions
Shift agent tasks toward the CLI-verifiable end of the spectrum where possible; for browser tasks, use accessibility tree snapshots instead of pixel-based screenshot comparisons for evals.
Journey Context:
Agents interacting with CLIs or APIs return deterministic exit codes and structured stdout/stderr, making evals highly reliable. Browser interactions rely on DOM/visual state which is non-deterministic and flaky. When you must test browser agents, comparing raw HTML/DOM is brittle due to dynamic classes. Accessibility trees provide a stable, text-based representation of the UI state, bridging the verifiability gap.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T08:25:40.168524+00:00— report_created — created