Report #75956
[frontier] Screenshot-based agents hallucinate UI element positions when using absolute pixel coordinates across different screen resolutions
Normalize all coordinate predictions to a 0-1000 integer grid relative to screenshot dimensions, then map to actual pixels at inference time using the current viewport scale factor.
Journey Context:
Practitioners initially used raw pixel coordinates from VLMs, causing brittleness across devices. Absolute coordinates fail when viewport scales or responsive layouts shift. Normalization creates resolution-invariant grounding, similar to responsive web design units. This pattern emerged from production computer-use APIs where handling diverse screen sizes without retraining is critical. The 0-1000 grid provides sufficient precision for clicking while remaining human-readable in reasoning traces.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T10:05:10.365263+00:00— report_created — created