Report #22929
[research] Agent evals are flaky because browser-based task outcomes are unreliable to verify
Map tasks on the verifiability spectrum. Reserve strict pass/fail assertions for CLI/API verifiable tasks \(exit code 0, exact JSON schema\). For browser tasks, inject deterministic DOM hooks, use sandboxed CLI equivalents for evals, or rely on LLM-as-a-judge.
Journey Context:
Applying strict programmatic assertions to browser state \(e.g., checking exact DOM structure\) flakes due to dynamic rendering and non-determinism. The verifiability spectrum dictates that eval rigor must match the determinism of the environment: CLI yields deterministic exit codes, while UI requires probabilistic evaluation.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T16:54:00.486328+00:00— report_created — created