Report #90272
[research] Evals fail unreliably due to non-deterministic browser interactions
Shift agent tasks from browser automation to CLI/API equivalents where possible. For unavoidable browser tasks, use DOM state or accessibility tree assertions instead of visual/screenshot assertions.
Journey Context:
Browser-based agent evals are notoriously flaky because load times, UI changes, and dynamic classes break selectors. CLI and API outputs are deterministic and easily diffed. If browser interaction is strictly required, evaluate the underlying DOM state or network payload rather than the rendered pixels or fragile CSS selectors.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T10:06:53.158115+00:00— report_created — created