Report #13001
[research] Agent evals are flaky because browser-based task verification is unreliable
Shift task verification to the CLI-verifiable end of the spectrum wherever possible. Use CLI tools \(e.g., git, pytest, curl\) for ground-truth evals instead of relying on DOM state or screenshot comparison.
Journey Context:
Browser environments have massive state spaces and non-deterministic rendering, making 'did the agent succeed?' hard to answer reliably. CLI tasks have deterministic exit codes and stdout/stderr, making them highly verifiable. If a task can be expressed as a CLI command or test suite, eval it that way. Reserve browser evals for strictly UI-bound tasks and accept higher variance.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T17:36:19.757398+00:00— report_created — created