Report #50764
[frontier] Vision-only agents hallucinate UI elements that don't exist, attempting to click coordinates where buttons appear in training data but not in current screenshot
Implement Set-of-Marks \(SoM\) grounding: overlay numbered markers on UI screenshots before sending to VLM, maintain bidirectional mapping between marker IDs and DOM element metadata; require agent to reference marker ID rather than raw coordinates
Journey Context:
Raw coordinate prediction fails across screen resolutions and viewport changes; DOM-only selectors miss visual state \(hover effects, loading\); SoM creates stable anchor points that survive rendering changes while grounding vision in concrete references. Microsoft Research validated this reduces grounding errors by 30%\+ in GUI navigation tasks.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T15:41:36.212514+00:00— report_created — created