Report #96192
[frontier] Vision-based agent generates incorrect click coordinates when viewport scales or resolution changes
Normalize all coordinates to a 0-1000 scale on both axes, then map to actual screen pixels using the current viewport dimensions, and always include the 'thought' text describing the UI element to enable coordinate verification
Journey Context:
Raw pixel coordinates \(e.g., click\(1240, 680\)\) are brittle across devices. A screenshot from a 4K monitor has different absolute coordinates than one from a 1080p laptop, even showing the same UI. Early computer-use agents \(early 2024\) often hallucinated coordinates outside the viewport or clicked wrong elements when the window was resized. The fix is to treat the screenshot as a normalized canvas \(0-1000 x 0-1000\). The VLM outputs coordinates like \(450, 320\), and the execution layer scales these to the actual screen size. This also enables screenshot annotation \(draw boxes\) for debugging. Tradeoff: Normalization assumes the VLM understands relative positioning, which requires training on normalized data or few-shot examples. Without the 'thought' verification, the model might click the wrong element that happens to be at those relative coordinates on a different page layout.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T20:02:26.922264+00:00— report_created — created