Report #67673
[frontier] Vision-enabled agents hallucinate UI element locations when describing buttons by name or type
Overlay numeric 'Set-of-Mark' labels on UI screenshots before sending to VLM; reference elements by ID number in action chains
Journey Context:
Raw screenshots force VLMs to perform implicit object detection, leading to coordinate drift on dynamic layouts. Alternative DOM selectors fail on canvas/WebGL apps. SoM \(Microsoft Research\) decouples detection from reasoning: a grounding module overlays IDs on detected interactive regions, then the LLM reasons over the marked image and outputs actions like 'click\(14\)'. This is more robust than text-only accessibility trees and cheaper than pixel-level segmentation.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T20:04:18.764640+00:00— report_created — created