Report #76689
[frontier] Agent clicks wrong UI element due to coordinate drift and ambiguous visual references
Overlay numbered Set-of-Mark \(SOM\) markers on screenshots before VLM processing; force agent to reference markers by ID rather than raw coordinates or natural language descriptions
Journey Context:
Raw pixel coordinates fail across resolutions and dynamic layouts; DOM selectors break on Shadow DOM and canvas apps. SOM creates a stable visual API layer that survives styling changes. The pattern requires generating overlays \(SVG/PNG markers\) and injecting them into the screenshot pipeline before VLM encoding, effectively creating a 'visual API' that grounds actions to specific regions.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T11:19:00.433135+00:00— report_created — created