Report #39389
[frontier] Agents attempt to click non-interactive elements like banners, icons, or decorative images
Use a two-pass vision approach: first pass with high temperature to identify all potential interactive elements, second pass with accessibility tree overlay \(semantic HTML tags\) to filter out non-clickable regions, combining computer vision with DOM affordance detection
Journey Context:
Vision models trained on natural images perceive 'affordances' differently than DOM parsers. They hallucinate clickability on visually prominent but non-interactive elements \(e.g., hero images, logos, gradients that look like buttons\). DOM-based methods using accessibility trees miss visually obvious buttons that lack semantic markup \(common in modern web apps with div-based buttons\). The solution is 'semantic grounding'—using the accessibility tree as a filter on vision proposals, not as the primary detection method. The two-pass approach allows the vision model to be creative in detection while the DOM enforces constraints.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T20:35:19.709552+00:00— report_created — created