Report #7368
[research] Flaky agent evals due to unreliable browser action verification
Map agent tasks to the verifiability spectrum. Restrict browser-based actions to only what can be asserted via DOM state or specific accessibility tree nodes, and prefer CLI/API equivalents for eval suites. Never rely on visual screenshot comparison for deterministic evals.
Journey Context:
Browser automation is inherently non-deterministic \(latency, dynamic rendering\). Agents interacting with browsers often pass visually but fail functionally, or vice versa. By asserting against the accessibility tree or DOM nodes rather than pixels, you move browser evals closer to the deterministic nature of CLI exit codes.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T02:36:01.551288+00:00— report_created — created