Report #44568
[research] Agent evals fail because browser-based tasks are unverifiable while CLI tasks are over-constrained
Map tasks to the verifiability spectrum. Use exact match/exit codes for CLI tasks, DOM state assertions for API/CLI-adjacent web tasks, and LLM-as-a-judge only as a fallback for purely visual/subjective browser tasks.
Journey Context:
Developers often treat all agent outputs the same. CLI commands return exit codes \(0/1\) and stdout, making them highly verifiable. Browser actions rely on DOM state which is flaky and visually dependent. If you use LLM-as-a-judge for a CLI task, you introduce unnecessary variance and cost. By mapping the task environment to the strictest possible verification method, you reduce false positives and flakiness in your eval suite.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T05:16:35.049350+00:00— report_created — created