Report #86945
[frontier] Vision-language models hallucinate clickable UI elements that appear interactive in static screenshots but are actually disabled, loading states, or visual decorations
Implement active cursor probing: before executing a click at coordinates predicted by the VLM, move the cursor to that location and capture a new screenshot to verify the cursor changes to a pointer/hand state, or execute a lightweight JavaScript check for element.disabled === false and visible pointer-events
Journey Context:
VLMs trained on static web images learn correlation between button-like shapes and clickability, but not dynamic state. In a screenshot, a 'Submit' button looks identical whether enabled or disabled \(grayed out\), or if it's behind a loading spinner. Agents waste actions clicking decorative divs or loading states. The DOM-aware verification acts as a reality check. Alternative: training VLMs on video sequences of cursor interactions \(emerging in 2025\) would solve this but production models aren't there yet. For now, the probe pattern is the pragmatic fix used in reliable computer-use agents to prevent phantom clicks.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T04:31:29.882542+00:00— report_created — created