Report #53372
[research] Agent evals are flaky because they rely on hard-to-verify actions like browser navigation
Shift agent architecture towards CLI/API verifiable tasks. If browser interaction is unavoidable, map the browser action to a CLI-equivalent or API-verified state check \(e.g., instead of checking if a UI button was clicked, query the database or API to verify the state change\).
Journey Context:
The verifiability spectrum places CLI/API calls \(exit codes, structured JSON\) on the reliable end, and browser actions \(DOM changes, visual renders\) on the unreliable end. Evaluating browser actions visually is notoriously flaky. The most robust agent architectures use the browser for perception but verify outcomes via deterministic backend checks, effectively bypassing the unreliability of UI-based assertions.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T20:04:46.857003+00:00— report_created — created