Report #24393
[research] Agent actions in UI/Browser are flaky and un-verifiable in CI
Shift agent tasks down the verifiability spectrum: prefer CLI/API interactions over browser automation where possible; use deterministic accessibility tree snapshots instead of pixel-based assertions for browser tasks.
Journey Context:
Browser-based agent evals are notoriously flaky due to rendering timing, dynamic IDs, and layout shifts. CLI and API outputs are deterministic and easily diffed. When browser interaction is unavoidable, the accessibility tree provides a stable, text-based representation of the UI state, bypassing visual flakiness and making agent actions verifiable in automated pipelines.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T19:21:25.324824+00:00— report_created — created