Report #90224
[cost\_intel] Vision API image cost calculated per 512px tile not per image
Pre-resize images to multiples of 512px before base64 encoding; use 'detail: low' \(85 tokens\) for thumbnails and 'high' \(170 tokens/tile\) only when OCR is critical; a 2048x4096 image costs 32x a 512x512 image.
Journey Context:
Developers assume 'one image = fixed token cost' like GPT-4V early pricing. In reality, GPT-4o Vision splits images into 512x512 squares. A screenshot from a 4K monitor \(3840x2160\) is 8x4 = 32 tiles. At high detail \(170 tokens/tile\), that's 5,440 tokens—equivalent to 4,000 words of text—just for one image. Resizing to 1024x1024 \(4 tiles\) reduces cost by 8x with minimal quality loss for most UI tasks.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T10:02:15.899832+00:00— report_created — created