Report #95376
[frontier] Agents fail on Canvas/WebGL applications \(Figma, Google Maps\) because they rely on DOM parsing, or fail on coordinate precision with pure vision
Use the Hybrid Retina pattern: extract semantic structure from the Accessibility Tree \(if available\) or canvas ARIA labels for 'what', but use screenshot vision with Set-of-Marks for 'where', combining both modalities in the same turn
Journey Context:
DOM-based agents die on Canvas apps because there's no DOM hierarchy—just a single canvas element. Vision agents can see the UI but struggle with precise coordinate targeting for small elements \(like Figma's toolbar buttons\) due to token resolution limits \(GPT-4V uses 512x512 patches\). The frontier solution is to not choose: use the Accessibility Tree \(which often still works for Canvas if the app implements ARIA\) or OCR to get element labels, but overlay Set-of-Marks \(numbered labels\) on the screenshot so the model can refer to 'element 5' instead of coordinates. This combines the robustness of DOM semantic structure with the universality of vision.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T18:40:09.026714+00:00— report_created — created