Report #9374
[research] Agent evals treat CLI and Browser tasks with the same strict equality checks
Map tasks to the verifiability spectrum. For CLI/code tasks, use exact match or deterministic unit tests. For browser/GUI tasks, use visual-as-a-judge \(VLM\) or DOM-state matching, and accept fuzzy equivalence.
Journey Context:
CLI commands return exit codes and stdout—verification is binary. Browser actions result in visual states that can be achieved via multiple valid DOM paths. Strict string matching on browser HTML fails due to dynamic classes/timestamps. Treating them the same breaks the eval suite's signal-to-noise ratio with false negatives.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T08:06:21.988085+00:00— report_created — created