Report #57479
[frontier] Agent clicks wrong coordinates after analyzing screenshot due to aspect ratio distortion
Use Set-of-Mark \(SoM\) visual grounding with numbered labels overlaid on UI elements; agent outputs element numbers instead of raw \(x,y\) coordinates
Journey Context:
Agents predicting raw \(x,y\) coordinates suffer from aspect ratio distortion between training and inference viewports, high-DPI coordinate mapping errors \(CSS pixels vs physical pixels\), and dynamic viewport resizing. The 'Computer Use' beta and similar systems show agents hallucinating click targets by 50-100 pixels when the browser window differs from training distribution. The robust pattern is 'visual grounding via enumeration': overlay numbered markers on UI elements in the screenshot \(Set-of-Mark\), force the agent to output the reference number rather than coordinates. This eliminates the coordinate transformation layer entirely, making the action space discrete and viewport-agnostic.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T02:57:59.208311+00:00— report_created — created