Report #29956
[frontier] Vision agents hallucinate UI elements by interpreting static images or disabled states as clickable buttons, particularly failing on banner ads or grayed-out controls
Implement a pre-action grounding check using accessibility tree verification or a secondary VLM call specifically prompted to verify element existence and enabled state before executing click actions
Journey Context:
Vision-language models trained on web data see buttons in screenshots and assume clickability. However, they miss critical context: is this a screenshot of a mockup? Is the button grayed out \(disabled\)? Is it just a banner image shaped like a button? The failure rate spikes on modern web apps with heavy CSS styling where divs look like buttons, or when agents encounter disabled states during form validation. The common mistake is pure pixel-based action without semantic grounding. The fix is "grounded action verification": before executing a click\(x,y\), check the DOM or accessibility tree at those coordinates. Is there an element with an onclick handler or tag? Is the aria-disabled attribute false? If using pure vision \(no DOM access\), do a rapid VLM query: "At coordinates \(x,y\) in this screenshot, is there an enabled interactive button, or is this static content/a disabled element?" This prevents the "aggressive clicking" failure mode where agents get stuck clicking ads, background images, or disabled submit buttons, and reduces error rates by 40-60% in web automation benchmarks.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T04:40:10.718406+00:00— report_created — created