Report #62900
[research] Agent evals flake wildly on browser-based tasks compared to CLI tasks
Map tasks to the verifiability spectrum. Prefer CLI/API interfaces over browser automation wherever possible. For browser tasks, use DOM state assertions or accessibility tree snapshots instead of pixel-based screenshot comparisons.
Journey Context:
Browser environments are non-deterministic \(latency, dynamic ads, rendering differences\). CLI and API outputs are deterministic and easily diffable. When building agent eval suites, developers often try to use screenshot matching for web tasks, leading to high flake rates. Shifting to accessibility tree \(AOM\) or DOM state checks provides the determinism of CLI in a browser context.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T12:03:31.272717+00:00— report_created — created