Report #66203
[frontier] Vision agents generate incorrect click coordinates when reasoning over raw UI screenshots, causing misclicks on small buttons or icons
Pre-process screenshots with Set-of-Mark \(SoM\) visual markers—overlay numbered masks on interactive regions before VLM inference, then instruct model to output marker numbers instead of raw coordinates
Journey Context:
Raw coordinate prediction suffers from resolution-dependent drift \(x=100 at 1080p ≠ x=100 at 4K\); text descriptions of regions consume excessive tokens; SoM creates discrete anchor tokens that ground VLM attention to specific UI elements. Critical implementation: use icon detection \(OmniParser or DETR\) to place markers only on interactive elements, not background noise. Tradeoff: adds preprocessing latency.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T17:35:50.095302+00:00— report_created — created