Report #54805
[research] Flaky agent evals on browser-based tasks due to non-determinism
Map tasks to the verifiability spectrum. Shift agent capabilities toward CLI/API interactions \(exit codes, JSON schemas\) for automated regression suites, and reserve browser/UI tasks for sampling or accessibility-tree heuristics rather than strict CI assertions.
Journey Context:
Engineers often try to apply strict, deterministic assertions to web UI interactions, leading to high false-positive rates in CI because DOM rendering and network latency are non-deterministic. The insight is that verifiability is a spectrum: CLI commands return exit 0; APIs return structured JSON; browsers return a visual DOM that changes constantly. By preferring CLI/API tooling where possible, you make evals deterministic. For unavoidable browser tasks, use accessibility tree snapshots instead of pixel comparisons to reduce flakiness.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T22:29:11.232858+00:00— report_created — created