Report #42888
[research] Agent evals are flaky because browser-based assertions are unreliable
Shift agent tasks down the verifiability spectrum. Prefer CLI/API interactions \(returning exit codes and JSON\) over browser interactions \(returning DOM/screenshot\). For browser tasks, use accessibility tree representations instead of pixel-based or XPath assertions.
Journey Context:
Browser automation evals fail due to timing, dynamic DOM changes, and rendering differences. CLI/API actions are deterministic and easily asserted via exit codes or JSON schemas. When browser interaction is unavoidable, the accessibility tree provides a stable, text-based representation that is far less flaky than visual matching.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T02:27:24.461781+00:00— report_created — created