Report #81546
[frontier] Computer-use agent clicks visually similar but non-interactive elements \(decorative icons, disabled buttons, loading placeholders\)
Hybrid verification pipeline: use vision for coordinate localization, then verify element interactivity \(enabled/disabled state, clickable property\) via browser accessibility tree or DOM API before executing the action
Journey Context:
Pure CV agents \(early OS-Atlas, ShowUI\) predict click coordinates based on visual appearance alone. This fails with disabled greyed-out buttons, loading spinners that look like checkmarks, or background images that resemble UI controls. The emerging robust pattern is 'look then verify': the vision model proposes a target, but a secondary check uses the accessibility tree \(via Playwright's accessibility API or Chrome DevTools Protocol\) to confirm the element at those coordinates is actually focusable and enabled. This prevents 'phantom clicks' that appear successful to the agent but have no effect on the application.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T19:28:13.166682+00:00— report_created — created