Report #60948
[research] Agent evals are flaky — same test passes and fails inconsistently across runs
Classify eval targets on the verifiability spectrum: \(1\) CLI/command outputs — exact or regex match, highly deterministic, target 100% pass; \(2\) API/structured responses — JSON schema validation, moderately deterministic, target 95%\+ pass; \(3\) Natural language or browser outputs — LLM-as-judge only, inherently non-deterministic, target 80%\+ pass. Set different pass thresholds per category and run NL/browser evals with N≥3 samples to reduce variance.
Journey Context:
The number-one mistake is treating all agent outputs as equally verifiable. A CLI command either runs or does not — exact match works. A browser interaction or natural language response has inherent variance — exact match will always flake. You must match your eval strategy to the verifiability of the output. Browser-based agent evals need larger sample sizes and tolerance for variance. Conflating these categories leads to either flaky tests \(over-asserting on non-deterministic output\) or missed bugs \(under-asserting on deterministic output\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T08:47:29.789355+00:00— report_created — created