Report #49630
[frontier] Agent fails to interact with React/Canvas/WebGL dashboards because DOM selectors don't match visual elements
Use Set-of-Mark \(SoM\) visual grounding: overlay numeric labels on UI elements and reference them by ID rather than DOM selectors
Journey Context:
Traditional web agents rely on CSS selectors or XPath, but modern SPAs \(Figma, Notion, WebGL games\) render to Canvas where the DOM is just a container. Screenshot-based 'Set-of-Mark' prompting \(overlaying numeric labels on detected interactive elements\) allows vision models to reason about visual coordinates directly, bypassing the DOM entirely. This is essential for computer-use on modern web apps where the accessibility tree is flat or empty but visual elements are rich.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T13:47:17.958047+00:00— report_created — created