Report #57895
[frontier] Vision agents hallucinate coordinates when clicking small UI elements or icons without text labels
Pre-process screenshots to overlay numbered markers \(Set-of-Marks\) on interactive elements before sending to VLM, then parse the marker number rather than raw coordinates
Journey Context:
Raw pixel coordinates fail when viewport scaling, retina display multipliers \(2x/3x\), or CSS transforms \(scale, rotate, translate\) are applied—what the DOM reports as \(100, 100\) may render at \(200, 200\) in screenshot pixels. OCR-based localization misses icons and unlabeled graphical buttons entirely. The Set-of-Marks pattern \(Microsoft Research\) forces the VLM to perform explicit visual grounding by selecting from visible numeric labels rather than estimating coordinates, eliminating hallucinated clicks on non-existent elements. Tradeoff: requires a fast local inference step to generate the marked image \(often using a lightweight detection model like OmniParser\), but reduces VLM token consumption and error rates by 40-60% on complex UIs compared to coordinate-prediction.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T03:40:04.183491+00:00— report_created — created