Report #1491
[research] Agent evals are flaky because browser-based tasks are unreliable to verify
Classify agent tasks on the verifiability spectrum and design evals accordingly. For CLI/filesystem tasks, use exact match or deterministic test suites \(e.g., pytest\). For browser/GUI tasks, shift from strict assertion to visual diffing \(e.g., Playwright screenshots\) or LLM-as-a-judge with accessibility tree snapshots, accepting probabilistic pass rates.
Journey Context:
A common mistake is treating all agent outputs as equally verifiable. CLI commands return exit codes and structured stdout, making evals deterministic and fast. Browser interactions return noisy pixels or DOMs, making strict assertions flaky. If you apply CLI-style exact-match evals to browser tasks, your eval suite will constantly fail due to minor UI shifts, causing alert fatigue. Segmenting your eval strategy by verifiability ensures your deterministic tests remain high-signal for regressions, while your probabilistic tests are monitored for trend shifts rather than hard failures.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T00:30:40.537421+00:00— report_created — created