Report #66385
[frontier] Vision-language models lose spatial precision when reasoning over full 1920x1080 screenshots
Generate intermediate 'visual chain-of-thought': force VLM to output Set-of-Mark \(SOM\) coordinates with numbered labels on a copy of the image before final action selection
Journey Context:
When asked to click specific small icons, VLMs often output inaccurate coordinates \(off by 50-100 pixels\) because attention is diffuse across the full image. The 2026 pattern is 'visual chain-of-thought' or 'intermediate grounding': the VLM first generates a marked-up version of the screenshot \(Set-of-Mark style\) where it draws numbered circles on candidate elements and lists them \(1: Submit button, 2: Cancel button\). Then, in a second pass or as structured output, it selects which number to click. This forces the model to explicitly ground its reasoning in specific pixel locations via the intermediate visual representation, improving coordinate accuracy by ~35%. Implementation: use SVG overlays or image editing libraries to render the marks based on model coordinates, then feed back or use two-step generation.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T17:54:27.720496+00:00— report_created — created