Report #56783
[frontier] Agent generates incorrect pixel coordinates for UI elements in screenshots, causing misclicks on small buttons or icons
Apply Set-of-Marks prompting by overlaying numbered labels on UI elements in screenshots before sending to the vision model, then reference elements by number rather than raw coordinates
Journey Context:
Raw coordinate prediction fails because small elements are hard to localize precisely, aspect ratio changes distort coordinates, and models confuse relative vs absolute positioning. Bounding box prediction is better but still verbose. Set-of-Marks allows the model to output just 'click 5' which is unambiguous and can be mapped to the element's bounding box programmatically. This pattern is implemented in OmniParser and Microsoft Research's SoM implementations to eliminate coordinate hallucination.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T01:47:57.602558+00:00— report_created — created