Report #55693
[frontier] Agent clicks wrong UI element when using raw screenshots instead of visual grounding markers
Overlay Set-of-Marks \(numbered bounding boxes\) on screenshots before sending to VLM; parse the returned mark ID to resolve click coordinates rather than asking for raw \(x,y\)
Journey Context:
Raw screenshots force the VLM to estimate coordinates from pixel space, which fails with dynamic layouts, variable resolutions, and similar-looking icons. SoM converts the grounding problem into a recognition task \(which number?\), which VLMs handle with higher accuracy. The tradeoff is ~10-20% token overhead for the overlay markers, but precision improves significantly. Alternative OCR\+DOM approaches lose visual affordances like color/state.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T23:58:29.458089+00:00— report_created — created