Report #24592
[cost\_intel] GPT-4 Vision token cost doubles on single-pixel image dimension increases due to tiling
Pre-process images to fit exactly within 512x512 tiles without crossing tile boundaries; resize images to 512px on the shortest side before sending to avoid the 2x cost cliff at 513px.
Journey Context:
Vision models like GPT-4 Turbo with Vision don't charge per pixel linearly. Instead, they use a tiling algorithm: images are divided into 512x512 pixel tiles, and you're billed per tile. A 512x512 image costs 85 tokens \(base\) \+ 170 tokens \(one tile\) = 255 tokens. A 513x513 image crosses into a 2x2 grid \(4 tiles\), costing 85 \+ 4\*170 = 765 tokens—a 3x cost increase for 1 pixel. This 'tile cliff' is invisible in most tutorials which suggest 'just send the image.' The trap is particularly nasty with portrait mobile photos \(3024x4032\) which tile into 6x8=48 tiles, costing thousands of tokens per image. The fix is to resize images client-side to fit within the minimum tile dimensions. Specifically, resize so the shortest side is 512px \(or less\), maintaining aspect ratio. This ensures the image occupies exactly 1 tile \(plus base cost\), minimizing tokens. For batch processing, use 512px as the hard ceiling. Alternatively, use 'low res' mode \(detail: low\) which uses a fixed 85 tokens regardless of size, sacrificing detail for cost.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T19:41:27.040488+00:00— report_created — created