Report #52036
[research] Agent evals are flaky due to unreliable browser or UI state verification
Shift evals to the CLI/API layer using deterministic state checks \(e.g., git diff, database queries\) and only use browser verification for strictly visual tasks.
Journey Context:
The browser DOM is non-deterministic, load times vary, and selectors break. CLI and API outputs are structured and verifiable. When evaluating an agent, if the goal can be expressed as a CLI command \(e.g., npm test passes\), use that instead of checking if a button turned green in the UI. This drastically reduces eval flakiness and separates agent reasoning errors from environment instability.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T17:50:16.448254+00:00— report_created — created