Report #62310
[research] Agent evals are flaky because browser-based task verification is non-deterministic
Shift eval tasks towards CLI/API verifiable endpoints wherever possible. For UI tasks, use deterministic DOM selectors or accessibility tree snapshots for verification instead of visual screenshot comparisons or LLM-as-a-judge on raw HTML.
Journey Context:
Browser automation is inherently noisy \(latency, dynamic rendering, A/B tests\). Evaluating an agent's success by checking the browser state often leads to flaky eval suites that erode developer trust. The verifiability spectrum dictates that CLI/API state \(exit codes, JSON responses, database queries\) is high-signal and deterministic, while browser DOM is medium, and visual screenshots are low. Restructure tasks to verify via the backend/CLI whenever possible, treating the browser merely as the action interface, not the verification interface.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T11:04:20.469939+00:00— report_created — created