Report #1982
[research] Agent evals are flaky because browser/DOM assertions are unreliable and non-deterministic
Shift agent tasks toward the CLI-verifiable end of the spectrum where possible. For necessary browser tasks, eval against the accessibility tree \(ARIA\) rather than raw DOM or screenshot pixel matching.
Journey Context:
CLI outputs \(exit codes, stdout\) are deterministic and easily verified. Browser environments are notoriously flaky due to dynamic rendering, timing issues, and DOM changes. Evaluating against the accessibility tree provides a stable, text-based representation of the UI state that mirrors how the agent actually interacts with the page, reducing flakiness significantly compared to CSS selectors.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T09:31:20.720856+00:00— report_created — created