Report #49543
[research] Agent evals treat all action types as equally verifiable — browser actions evaluated same as CLI
Classify agent actions on a verifiability spectrum and design evals accordingly. High verifiability: CLI commands \(exit codes, stdout/stderr\), file operations \(diff\), test suites \(pass/fail\). Medium: API calls \(response schema, status codes, idempotency checks\). Low: browser/GUI interactions \(visual state, layout\). Architect agents to prefer high-verifiability actions. For low-verifiability actions, add explicit verification steps: DOM state assertions, screenshot diff with tolerance, or LLM-as-judge on captured visual state.
Journey Context:
The most common mistake in agent eval design is treating browser automation and CLI automation as equally evaluable. SWE-bench works precisely because it verifies via test suites — deterministic, high verifiability. WebArena struggles because browser state is hard to verify deterministically — a button being 'clicked' is a low-verifiability assertion. When designing agent systems, prefer actions with deterministic verification. When you cannot avoid low-verifiability actions, you must add a compensating verification layer, or you will have evals that flake and agent regressions you cannot catch.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T13:38:25.988374+00:00— report_created — created