Report #15510
[research] Agent evals are flaky because browser-based and GUI tasks are inherently non-deterministic
Classify agent tasks on a verifiability spectrum: CLI/shell commands \(fully verifiable via exit codes and stdout\), API calls \(verifiable via response schemas and status codes\), browser/GUI actions \(unreliable—require visual assertions or DOM snapshots\). Prefer CLI-verifiable tasks in eval suites; for browser tasks, use explicit wait conditions and snapshot-based assertions rather than timing-dependent checks
Journey Context:
The fundamental insight is that not all agent actions are equally verifiable. CLI commands give you deterministic exit codes and file system state. API calls give you structured responses. Browser actions depend on rendering timing, network latency, and DOM state. People write evals that treat all actions the same, leading to flaky tests that erode trust in the eval suite. Structure your eval strategy around verifiability: high-confidence assertions for CLI tasks, probabilistic assertions for browser tasks.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T00:19:19.269292+00:00— report_created — created