Report #30184
[frontier] Agents mix viewport, window, and screen coordinates causing off-target interactions
Standardize on CSS viewport coordinates \(0,0 at top-left of layout viewport\) for all predictions; implement coordinate transform layer that maps to OS-level screen coordinates only at execution time.
Journey Context:
Computer-use agents operate across abstraction layers: OS \(screen coordinates\), window manager \(client area\), browser \(viewport with zoom\), and web page \(CSS pixels\). An agent predicting \(100, 100\) might mean pixels from screen corner, window corner, or viewport corner. When DPI scaling \(125%, 150%\) or browser zoom is applied, coordinates drift. The robust pattern is 'predict in CSS viewport pixels, transform at boundary': agent always outputs coordinates relative to the layout viewport origin, ignoring scroll position \(use fixed positioning logic\). The execution driver then translates these to OS screen coordinates using window geometry APIs \(GetWindowRect on Windows, X11 queries on Linux\), accounting for DPI scaling.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T05:03:05.316566+00:00— report_created — created