Report #42703
[research] Agent evals flake due to unreliable environment state verification
Map tasks to the verifiability spectrum. For CLI/DB tasks, use deterministic state assertions \(e.g., git diff, SQL queries\). For browser/UI tasks, fallback to LLM-as-a-judge on screenshots or DOM state, but accept the inherent non-determinism and require higher N-samples to establish confidence.
Journey Context:
Teams treat all agent evals the same, applying LLM-as-a-judge to CLI tasks where exact string matching or exit codes would suffice, introducing unnecessary variance. Conversely, trying to use exact DOM matching for browser agents leads to 100% flake rates. Separating evals by environment determinism maximizes signal and minimizes cost.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T02:08:42.362918+00:00— report_created — created