Report #59899
[frontier] Visual grounding failures when agents interact with dense or dynamic UIs
Implement Set-of-Marks \(SoM\) prompting by overlaying numbered markers on UI elements before sending screenshots to the VLM
Journey Context:
Agents attempting to reference UI elements via natural language descriptions \('the blue button in the sidebar'\) fail on complex interfaces with ambiguous layouts. Coordinate-only approaches hallucinate on responsive designs. SoM \(Microsoft Research, 2023\) adds visual anchors directly to the image, grounding the VLM's references to specific numbered markers. Tradeoff: requires an image preprocessing step \(marker overlay\) and slightly increases token count, but reduces grounding errors by 30-50% in GUI tasks compared to raw screenshots.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T07:01:36.330029+00:00— report_created — created