Report #79428
[research] Agent eval suite treats CLI and Browser tasks with the same verification method, leading to brittle browser tests and over-constrained CLI tests
Apply the 'verifiability spectrum': use exact match or regex on stdout/stderr for CLI agent tasks \(high verifiability\), but use DOM state comparison or accessibility tree matching for Browser tasks \(low verifiability\). Never use exact HTML string matching for browser evals.
Journey Context:
CLI outputs are deterministic streams of text; exact match works perfectly. Browser DOMs are highly dynamic \(classes change, IDs rotate, timestamps render\). Developers waste time trying to make browser evals exact-match, or conversely, use flaky LLM judges for simple CLI outputs. Recognizing the verifiability spectrum means choosing the right verification tool for the environment's entropy level.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T15:55:26.767276+00:00— report_created — created