Report #83452
[frontier] VLM fails to locate interactive elements in shadow DOM or canvas-based UIs
Adopt Set-of-Mark \(SOM\) prompting by overlaying numerical labels on UI elements in screenshots, forcing the VLM to reference elements by ID \(e.g., 'click\(23\)'\) rather than spatial coordinates or descriptions.
Journey Context:
VLMs struggle with precise spatial reasoning and hallucinate buttons, especially in flat designs or canvas-rendered interfaces where semantic DOM is absent. DOM parsing misses canvas content. SOM grounding \(labeling each interactable element with a visible number in the image\) lets the model output symbolic references instead of coordinates, drastically reducing grounding errors and enabling interaction with canvas games or WebGL dashboards.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T22:39:38.259672+00:00— report_created — created