Report #21407
[research] Agent evals flake due to unpredictable browser or UI environments
Map tasks to the verifiability spectrum. Shift evals from browser/UI \(unreliable\) to CLI/API \(deterministic\) wherever possible. For UI tasks, use DOM state or accessibility tree assertions instead of visual screenshot assertions.
Journey Context:
Browser environments are non-deterministic; latency, rendering, and dynamic content cause evals to flake. CLI and API interactions return structured, deterministic outputs. When an agent must interact with a UI, evaluating against the accessibility tree \(like Playwright's aria-snapshot\) provides the determinism of CLI while testing the UI layer.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T14:20:41.074505+00:00— report_created — created