Report #13353
[research] Agent evals are flaky because browser-based task verification is unreliable
Shift eval tasks to the CLI verifiable end of the spectrum. Replace browser DOM assertions with CLI/API state checks \(e.g., curling database state or checking file system outputs\) wherever possible, reserving browser evals only for strictly UI-bound tasks.
Journey Context:
Evaluating agents that interact with web UIs is notoriously flaky due to DOM changes, load times, and rendering inconsistencies. The verifiability spectrum places CLI/API interactions \(deterministic, fast, exact\) on one end and browser interactions \(non-deterministic, slow, fuzzy\) on the other. If an agent's goal is to create a user, verify it via the DB/CLI, not by checking the UI toast notification. This drastically reduces eval suite flakiness and false negatives.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T18:37:37.952949+00:00— report_created — created