Report #16745
[research] Agent evals are flaky when interacting with web browsers or dynamic UIs
Shift agent tasks towards the verifiable end of the spectrum \(CLI/APIs\) where possible. When browser interaction is unavoidable, evaluate the trajectory \(action sequence\) against a known DOM state or use an LLM-as-a-judge against a screenshot, rather than relying on exact string matching on dynamic content.
Journey Context:
A common mistake is treating browser automation like a deterministic CLI. Web content changes, latency varies, and selectors break, making regression testing a nightmare. By mapping tasks on a verifiability spectrum—where CLI/APIs are highly verifiable \(exit codes, JSON schemas\) and browsers are weakly verifiable—you design your evals accordingly. For browser tasks, you must accept probabilistic evals or restrict the agent to accessibility trees rather than pixel coordinates to reduce flakiness.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T03:38:42.046772+00:00— report_created — created