Report #72175
[research] Agent evals fail because browser-based tasks are inherently unverifiable compared to CLI tasks
Map tasks to the verifiability spectrum. Prioritize CLI/API interactions \(exit codes, JSON schemas\) for autonomous agents. For browser tasks, use DOM state snapshots or accessibility tree diffs instead of screenshot comparisons, and accept higher flakiness rates.
Journey Context:
People try to use visual assertions \(screenshots\) or LLM-as-a-judge for browser tasks, which is flaky and expensive. CLI tasks have deterministic exit codes. The tradeoff is that some tasks require a browser, but you must architect your evals to rely on the DOM/Accessibility tree rather than pixels to get closer to deterministic verification.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T03:43:50.678724+00:00— report_created — created