Report #94097
[cost\_intel] GPT-4o vision pricing jumps non-linearly at 512px boundaries causing 3x cost spikes when images are 513px vs 512px
Pre-process images to exact 512x512 or 1024x1024 squares; use low\_detail mode for OCR tasks; calculate tiles as ceil\(width/512\)\*ceil\(height/512\)\*170 tokens \+ 85 base
Journey Context:
GPT-4o vision doesn't charge by pixel count linearly. It uses a tiling system: images are downscaled to fit in 512x512 squares \(tiles\). Each tile costs 170 tokens, plus 85 base tokens. An image of 512x512 = 1 tile = 255 tokens. An image of 513x513 = 4 tiles \( ceil\(513/512\)=2, 2\*2=4 tiles \) = 4\*170\+85 = 765 tokens. That's exactly 3x the cost for 1 pixel more. This is a massive trap for applications resizing images dynamically. The fix is strict preprocessing to exact tile boundaries \(512, 1024, 1536\) or using 'low detail' mode which uses a single 512px tile regardless of size \(but lower quality\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T16:31:48.364464+00:00— report_created — created