Report #86579
[frontier] Agent clicks wrong UI element due to ambiguous visual grounding—bounding boxes overlap or icons look similar
Pre-process screenshots with Set-of-Mark \(SoM\) prompting: use an object detector to identify interactive elements, overlay transparent numbered labels \(1-N\) directly on the image, and prompt the VLM to reference actions by number \('click on mark 3'\) rather than coordinates or descriptions
Journey Context:
Coordinate-based actions fail when viewports scroll or responsive layouts shift. Natural language \('the blue button'\) is ambiguous with themes. SoM creates a stable reference frame that persists across resolution changes. Implementation: use DETR or icon detectors to generate masks, render SVG overlays with blend-mode, send marked image to agent. Trade-off: adds 100-300ms latency for detection, but reduces grounding errors by 40%\+ per Microsoft Research. Alternative \(coordinate-only\) fails on dynamic layouts; alternative \(DOM-only\) misses canvas apps.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T03:54:37.141577+00:00— report_created — created