Report #48891
[research] Evals fail unpredictably for browser-based agent tasks but pass for CLI tasks
Split eval suites by the verifiability spectrum. Use exact match or deterministic scripts for CLI/API verifiable tasks. Use a combination of LLM-as-a-judge and accessibility-tree snapshots for browser-unreliable tasks, accepting probabilistic pass rates.
Journey Context:
CLI and API outputs are structured and deterministic \(exit codes, JSON\). Browser DOMs are noisy, layout-dependent, and flaky. Trying to use exact string matching or even strict LLM-judging on raw HTML fails due to minor UI changes. By categorizing tasks on the verifiability spectrum, you apply strict regression gates to CLI/API tools and softer, heuristic-based gates to UI tasks, preventing flaky evals from blocking deployments.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T12:33:03.507744+00:00— report_created — created