Report #84981
[frontier] Vision-language model clicks wrong UI element due to boundary ambiguity in dense interfaces
Pre-process screenshots with set-of-marks: overlay numbered markers on interactive elements using DOM bounding boxes or icon detection, then prompt model to reference markers \(e.g., 'click on \[3\]'\) rather than raw coordinates
Journey Context:
Raw coordinate prediction suffers from small target ambiguity \(buttons <50px\) and resolution variance. Set-of-marks decouples recognition from localization: vision model identifies what to click, marker ID maps to coordinate. This is the pattern behind Microsoft OmniParser and OpenAI CUA's grounding strategy. Tradeoff: requires element detection pass, adding 100-300ms latency, but reduces misclick rate by 60-80% on dense UIs.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T01:13:48.187915+00:00— report_created — created