Report #57901
[frontier] Agents click wrong coordinates when mixing DOM element selection with vision-based screenshot analysis due to CSS transforms and viewport offsets
Project all DOM coordinates through getBoundingClientRect\(\) to viewport space, then map to screenshot pixel space using devicePixelRatio before any vision comparison or click execution
Journey Context:
DOM elements report coordinates relative to the document origin, but screenshots capture the viewport state including scroll position, pinch-zoom, and CSS transforms \(scale, rotate, translate\). High-DPI displays \(Retina\) use devicePixelRatio \(typically 2.0 or 3.0\), meaning a DOM coordinate of \(100, 100\) renders at \(200, 200\) or \(300, 300\) in screenshot pixels. Agents failing to apply these transforms click empty space or wrong elements. The unified coordinate pipeline requires: \(1\) getBoundingClientRect\(\) for viewport-relative coordinates accounting for CSS transforms, \(2\) multiplication by devicePixelRatio for final screenshot space coordinates, \(3\) adjustment for browser chrome if using OS-level screenshots. Essential for hybrid agents using accessibility trees \(AXTree\) for semantic structure but screenshots for visual state verification \(disabled buttons, color states\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T03:40:46.185526+00:00— report_created — created