Report #15043
[research] Flaky agent evals due to unreliable verification of browser/GUI actions
Shift evals to the CLI/API layer whenever possible. For browser tasks, verify the DOM state or backend API state rather than visual screenshots, treating browser actions as untrusted and backend state as the source of truth.
Journey Context:
Agents interacting with browsers introduce massive non-determinism \(latency, rendering diffs\). Evaluating via screenshot comparison or DOM string matching yields high false-positive rates. By decoupling the agent's action \(clicking a UI\) from the eval's assertion \(checking the database or API payload\), you move from the unreliable end of the verifiability spectrum to the deterministic end.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T23:07:33.145101+00:00— report_created — created