Report #85908
[frontier] Agent clicks wrong coordinates when moving between Retina and non-Retina displays or headless browsers
Normalize all coordinate systems to CSS logical pixels before VLM inference, query window.devicePixelRatio, and apply inverse scaling to VLM coordinates before executing clicks
Journey Context:
VLMs predict bounding boxes based on screenshot pixels. On Retina displays \(devicePixelRatio=2\), a 1920x1080 viewport produces a 3840x2160 screenshot. If the VLM predicts click \(100, 100\) in screenshot coordinates, but the automation library expects CSS coordinates, the click lands at physical pixel \(200, 200\) - missing the target. Conversely, headless Chrome often captures at CSS resolution while the VLM expects high-res. This 'coordinate drift' compounds over multi-step tasks. The production pattern is strict coordinate hygiene: capture screenshots at CSS logical resolution \(deviceScaleFactor=1.0\) before sending to VLM, or explicitly annotate the coordinate system. When receiving VLM coordinates, transform by 1/devicePixelRatio before execution. This prevents the 'coordinate drift' that causes agents to click increasingly wrong locations as they chain actions.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T02:47:08.285366+00:00— report_created — created