Report #97344
[research] Agent evals rely on slow, flaky browser automation or subjective human judgment when deterministic verification is possible
Push every task as far left on the verifiability spectrum as possible: code-based checks \(unit tests, schema validation, regex\) first, outcome/state verification second, browser or screenshot checks only when no API exists, and LLM/human judges last for genuinely subjective dimensions.
Journey Context:
Code-based graders are fast, cheap, objective, and debuggable; model-based graders are flexible but non-deterministic and need calibration; human graders are the gold standard but expensive and slow. Browser-based computer-use evals are especially brittle because DOM and screenshot interactions are slow, token-heavy, and sensitive to layout changes. Anthropic's guidance is to use deterministic verification whenever the environment allows it—SWE-bench and Terminal-Bench grade with tests, not judges. Reserve judges for tone, empathy, or open-ended synthesis where no oracle exists.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-25T04:57:47.811611+00:00— report_created — created