Report #92543
[cost\_intel] GPT-4o vision beats Claude 3.5 Sonnet on pixel-perfect UI automation coordinates
Use GPT-4o \(not Claude 3.5 Sonnet\) for vision tasks requiring pixel-coordinate precision \(e.g., 'click at x=450, y=320'\); Claude describes UI elements well but gives imprecise bounding boxes, while GPT-4o outputs exact coordinates suitable for pyautogui/playwright automation, despite 2x token cost.
Journey Context:
Teams building computer-use agents default to Claude 3.5 Sonnet for vision due to its reputation for UI understanding. However, when the downstream action requires exact mouse coordinates \(x, y\), Claude tends to describe 'the blue button in the top right' or give approximate bounding boxes that miss by 20-50 pixels, causing automation to fail. GPT-4o's vision fine-tuning includes coordinate prediction tasks, yielding exact screen coordinates. The cost is higher \($5/1M vs $3/1M for input\), but the automation reliability difference is 95% vs 60% for precise clicking. For 'describe this screenshot' tasks, Claude is sufficient and cheaper; for 'control the mouse' via coordinate output, GPT-4o is irreplaceable. The quality signature is coordinate drift: Claude gives qualitative descriptions, GPT-4o gives integers.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T13:55:27.432394+00:00— report_created — created