Report #46839
[frontier] VLM agents fail to reliably click UI elements described only by text captions
Implement Set-of-Marks \(SoM\) by overlaying numbered labels on UI elements using a detection model \(GroundingDINO or SAM\), then prompt the VLM to reference elements by number rather than description
Journey Context:
Agents describing elements as 'the blue button on the left' hallucinate positions because VLMs lack precise spatial reasoning. Coordinates predicted from raw screenshots drift 10-30px on average. SoM decouples recognition from localization: the detection model handles bounding boxes, the VLM only needs to say 'click on mark 5'. This eliminates the coordinate hallucination problem entirely. Alternative of fine-tuning on coordinate regression requires massive GUI datasets and still generalizes poorly across screen resolutions.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T09:05:30.858894+00:00— report_created — created