Report #54934
[frontier] Ambiguous visual grounding in screenshot-only agents
Apply the 'Set of Marks' pattern by overlaying colored numbered markers \(1, 2, 3...\) directly on the screenshot before sending to the vision model, then instruct the model to refer to elements by mark number rather than description.
Journey Context:
Text descriptions of UI elements \('the blue button in the top left'\) are ambiguous and coordinate tuples require impossible precision. Pure vision agents struggle to precisely locate small interactive elements. The 'Set of Marks' technique \(Microsoft Research\) draws visual anchors \(colored circles with numbers\) on the image itself before API submission. The model then outputs 'click on mark 3'. This works across modalities and is more robust than description or coordinates. Alternatives like OCR-based element IDs require perfect text recognition which fails on icons or image-buttons.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T22:42:04.041542+00:00— report_created — created