Report #79749
[frontier] Vision-language agents fail to reliably locate GUI elements from raw screenshots due to coordinate hallucination and ambiguous spatial descriptions
Pre-process screenshots with Set-of-Mark \(SoM\) visual prompting: overlay numbered markers on detected interactive elements using an icon detection model, then prompt the VLM to reference elements by ID \(e.g., 'click on element 12'\) rather than coordinates or descriptions
Journey Context:
Raw VLMs struggle with precise coordinate prediction \(x,y\) because it's a regression task on low-resolution vision encoders. Describing elements by color/position \('the blue button on the left'\) is brittle to theme changes. SoM transforms the problem into discrete classification \(ID selection\), which VLMs handle with higher accuracy. The tradeoff is added latency from the detection pass \(YOLO/IconNet\). Common mistake: using SoM IDs without verifying the VLM actually 'sees' the marker \(some vision encoders downsample small overlays\). Alternative: OCR-based element naming, but fails on icons. SoM is the right call when you need pixel-perfect grounding on dynamic UIs.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T16:27:36.110199+00:00— report_created — created