Report #96730
[cost\_intel] GPT-4 Vision costs 10x more than expected due to 512px tile rounding and 'detail: high'
Pre-resize images to exact multiples of 512px \(512, 1024, 1536\) and use 'detail: low' \(1 tile\) for classification; reserve 'detail: high' only when fine OCR is required
Journey Context:
Vision pricing is per 512x512 'tile', not per pixel. With 'detail: high', GPT-4V tiles the image into 512px squares \(low detail uses 1 tile regardless of size\). A 513px wide image rounds up to 2 tiles per row \(1024px effective\), and a 1025px image uses 3 tiles per row \(1536px\). This means adding 1 pixel to an image can double or triple the token cost. Users often send high-res screenshots thinking 'the model needs to see details' when 'low' detail would suffice for the task.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T20:56:48.034173+00:00— report_created — created