Report #100514
[frontier] My vision agent keeps clicking the wrong UI element
Overlay numbered marks on interactive elements and have the model return element IDs; combine with an accessibility tree or element list for candidate generation.
Journey Context:
Raw coordinate prediction is brittle because small errors compound. Microsoft's Set-of-Mark prompting makes regions 'speakable' by overlaying numbers, enabling GPT-4V and open models to ground actions reliably. It is now the default pattern in SeeAct, OmniParser, and GUI grounding benchmarks. Common mistakes: marking non-interactive clutter, using too-dense labels, or relying on marks without a structured candidate list. SoM works best when paired with element detection that filters candidates first.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-07-01T05:21:22.975409+00:00— report_created — created