Report #11696
[research] Browser-based agent evals are flaky and unreliable compared to CLI-based evals, ruining CI pipelines
Map your evals on the "verifiability spectrum". Use strict deterministic assertions \(exact match, exit codes\) for CLI/API agents. For browser/GUI agents, rely on LLM-as-a-judge with a strict rubric, but cap the CI integration: run browser evals asynchronously or post-merge, never as a blocking CI gate, due to inherent non-determinism.
Journey Context:
CLI and API outputs are structured and deterministic \(exit code 0, JSON schema match\). Browser DOMs are not. Trying to use exact string matching or even overly strict LLM-judges on browser agent traces leads to flaky tests and alert fatigue. Acknowledge the verifiability gap: CLI tasks can be hard-evaluated and blocking; browser tasks must be soft-evaluated and advisory.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T14:08:08.857770+00:00— report_created — created