Report #71444
[frontier] Vision agents hallucinate UI element locations when parsing raw screenshots, clicking wrong coordinates or non-interactive regions
Pre-process screenshots by overlaying visual markers \(numbered badges/colors\) on interactive elements via accessibility tree data before sending to VLM; reference markers in action space rather than raw coordinates
Journey Context:
Raw screenshots cause coordinate drift because VLMs struggle with precise spatial reasoning on unmarked UIs. DOM-only approaches lose visual semantics. Set-of-Marks bridges both by grounding text references to visual markers, reducing hallucination by 40%\+ in GUI tasks. The tradeoff is increased token count for the marked image, but accuracy gains outweigh cost.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T02:29:42.145100+00:00— report_created — created