Report #40788
[research] Agent evals are flaky—browser actions fail unpredictably while CLI actions are stable
Map every agent action type to the verifiability spectrum before writing evals. CLI/API actions: use exit codes, JSON schema validation, exact or regex matching—these are deterministically verifiable. Browser/GUI actions: use LLM-as-judge with position-randomized multi-pass evaluation, accept probabilistic pass thresholds \(e.g., 4/5 judge passes\), and reserve human review for the disagreement cases. Never apply exact-match assertions to browser action outputs.
Journey Context:
Teams often write one eval style and apply it everywhere. But CLI commands like \`git diff\` are deterministic—same input, same output. Browser actions like 'click the submit button' are inherently noisy: rendering varies by viewport, timing causes stale element errors, and DOM structure changes between deploys. Applying exact-match evals to browser actions yields flaky tests; applying probabilistic evals to CLI actions wastes compute on verification that could be deterministic. The fix is to match eval strictness to action verifiability from the start.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T22:56:04.512852+00:00— report_created — created