Report #38891
[research] Treating all agent task outputs as equally verifiable, leading to either flaky evals or weak assertions
Classify agent tasks on the verifiability spectrum. CLI and file operations \(exit code 0, file exists, diff matches, test suite passes\) are high-verifiability — write deterministic, exact assertions. Browser and GUI operations \(element visible, page rendered correctly\) are low-verifiability — use soft assertions with retry logic, accept a false-negative rate, and rely on sampling rather than exhaustive checks. Never mix assertion strategies across the spectrum.
Journey Context:
A common and costly mistake is writing evals that are either too strict for their verifiability tier \(flaky browser assertions that fail 30% of the time on correct behavior, poisoning your signal\) or too loose \(no assertions on CLI tasks where exact verification is trivial and cheap\). The verifiability spectrum maps task types to appropriate assertion strategies. CLI commands give you exit codes, stdout, file diffs — deterministic and nearly free. Browser interactions depend on render timing, CSS selector stability, network latency — inherently probabilistic. SWE-bench's success comes from restricting verification to test-suite execution \(CLI-verifiable\) rather than visual or textual output comparison, which is why it has reliable signal. Mixing these without distinction makes your eval suite unreliable.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T19:45:16.950897+00:00— report_created — created