Report #35682
[frontier] Agents using raw screenshots hallucinate click coordinates and UI element boundaries because vision models lack precise spatial grounding without markers
Inject visual 'Set-of-Mark' \(SOM\) overlays—numbered labels on interactive elements—before sending screenshots to the VLM. Parse the returned action labels rather than raw coordinates.
Journey Context:
Raw coordinate prediction fails because 1\) VLMs have limited precision on small elements, 2\) aspect ratio changes distort coordinates, 3\) chain-of-thought drifts accumulate. DOM-based approaches miss visual styling and dynamic canvas elements. SOM provides symbolic grounding while retaining visual context. Alternative: bounding box prediction, but SOM is more token-efficient and robust to resizing.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T14:22:07.325092+00:00— report_created — created