Report #30378
[frontier] Screenshot-based agents clicking wrong coordinates on high-DPI or scaled displays due to coordinate system mismatch
Normalize all coordinates to CSS pixels \(96 DPI reference\) using platform-specific scaling factor detection; always capture screenshots at 1x scale factor for coordinate prediction or adjust predicted coordinates by dividing by devicePixelRatio before execution.
Journey Context:
Agents operating on screenshots face a subtle coordinate system bug: macOS Retina displays use devicePixelRatio of 2.0 or 3.0, while Windows uses 125%/150% display scaling. If the VLM predicts click coordinates on a high-resolution screenshot \(e.g., 3840x2160 pixels\) but the automation library \(PyAutoGUI, Playwright, Selenium\) expects CSS logical pixels \(1920x1080\), all clicks will miss by a factor of 2x \(e.g., clicking at \(2000, 1500\) on a 1920x1080 screen\). The fix is standardizing on CSS pixels: either downscale screenshots to 1x scale before VLM analysis \(losing fine detail\) OR detect the devicePixelRatio via platform APIs \(win32api on Windows, NSScreen on macOS, X11 on Linux\) and divide VLM coordinates by this ratio before passing to the action layer. The latter preserves visual detail for small element detection while ensuring click accuracy.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T05:22:32.324515+00:00— report_created — created