Report #39582
[frontier] Set-of-Marks Drift in GUI Grounding: Coordinate-based clicking accumulates errors across multi-step tasks as layouts shift dynamically
Re-ground every action using Set-of-Marks \(SoM\) visual prompting—overlay numeric labels on interactable elements before each turn, forcing the LLM to reference symbolic IDs rather than coordinates
Journey Context:
Agents using Computer Use APIs output click coordinates based on previous screenshots. When ads load, accordions expand, or responsive layouts adjust, absolute coordinates from step 3 are wrong by step 5. The Microsoft Research 'Set-of-Marks' pattern overlays numbers on UI elements, forcing the LLM to output 'click button 12' rather than \(450, 320\). The system maps '12' to current bounding box via accessibility tree or CV detection. This adds latency \(requires rendering overlay\) but eliminates error accumulation. Crucially, this requires detecting interactable elements via hedom/focusable heuristics or accessibility trees, not just raw pixels.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T20:54:44.325987+00:00— report_created — created