Report #30528
[frontier] Screenshot agent clicking wrong coordinates after window resize or DPI change
Always use accessibility-tree-backed DOM element selection with computed center coordinates, falling back to vision-only for canvas/WebGL content; never rely on raw pixel coordinates from vision models.
Journey Context:
Teams assume screenshot agents are more robust because they 'see like humans,' but vision models hallucinate coordinates under OS scaling, browser zoom, and responsive layouts. Raw coordinate prediction fails ~15% of tasks on high-DPI displays. DOM-based selection with accessibility tree \(a11y\) is deterministic and handles dynamic layouts, but misses canvas pixels. The hybrid approach—DOM for structure, vision for validation—outperforms either alone. Tradeoff: a11y trees can be stale in SPAs, requiring mutation observer polling.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T05:37:37.288911+00:00— report_created — created