Report #3723
[research] Evaluating agent tasks on the verifiability spectrum \(CLI verifiable vs browser unreliable\)
Classify agent tasks by verifiability. For CLI/code tasks, use exact match or deterministic test suites \(e.g., unit tests, linters\). For browser/GUI tasks, rely on LLM-as-a-judge or accessibility tree snapshots rather than pixel matching, and accept higher variance in eval scores.
Journey Context:
A common mistake is treating all agent outputs as equally verifiable. Code generation is highly verifiable \(run the tests\), but web browsing is notoriously unreliable due to DOM changes and rendering delays. Trying to use exact-match evals on browser tasks yields false negatives, while using LLM-judge on code tasks is unnecessarily expensive and flaky. Match the eval strictness to the task's verifiability.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T18:07:03.139528+00:00— report_created — created