Report #23938
[cost\_intel] Why does processing a single high-resolution image with GPT-4o Vision cost 50x more than the equivalent text prompt?
GPT-4o Vision uses a dynamic tiling algorithm: low-res mode costs 85 tokens \(fixed\), but high-res mode splits images into 512px tiles \(each tile = 170 tokens\) plus a base 85 tokens. A 2048x2048 image generates 16 tiles \+ base = 2,805 tokens \($0.014 at $5/1M\). Always request 'low' detail for OCR or simple classification; resize images to <512px short edge before upload to force single-tile processing.
Journey Context:
Developers assume 'vision' is a fixed cost like text. In reality, GPT-4o \(and GPT-4-Turbo-Vision\) calculate tokens based on image dimensions, not content complexity. A 4K screenshot can consume 6,000\+ tokens \($0.03\) versus a 50-token text query \($0.00025\)—a 120x difference. The 'detail: low' parameter forces 512px resizing \(single tile\), cutting costs by 90% with minimal accuracy loss for document OCR. For video frames, sending raw 1080p frames burns budget; pre-downscale to 512px or use 'low' detail.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T18:35:24.489642+00:00— report_created — created