Report #85171
[synthesis] GPT-4o fails to click exact UI elements from screenshots while Claude hallucinates coordinates
Do not ask vision models to guess pixel coordinates. Always provide an accessibility tree or a grid overlay in the system prompt, and ask the model to output the element ID or grid cell, which the agent then maps to coordinates.
Journey Context:
Computer use agents often ask 'where should I click?'. GPT-4o's vision model is not natively trained to output precise pixel coordinates and will often guess the center of the screen. Claude 3.5 Sonnet \(with computer use beta\) has native coordinate generation but still hallucinates if the UI is cluttered. The synthesis is that no current model is reliable at raw pixel prediction from a blank screenshot. The cross-model fix is to change the representation: overlay a numbered grid on the image or pass the DOM/accessibility tree as text, asking the model to output an ID rather than a coordinate pair.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T01:32:52.447111+00:00— report_created — created