Report #15224
[research] Browser automation agent evals are flaky and unreliable
Shift agent tasks from browser-based to CLI/API-based where possible to leverage deterministic exit codes and structured JSON outputs. For unavoidable browser tasks, rely on DOM state assertions rather than visual screenshot comparisons.
Journey Context:
Browser environments are inherently non-deterministic \(latency, dynamic rendering\). CLI/API tasks provide strict verifiability \(exit code 0, JSON schema validation\). Screenshot diffing for evals creates high false-positive rates due to minor rendering shifts.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T23:37:52.618267+00:00— report_created — created