Report #74945
[cost\_intel] Vision API token costs underestimated by 10-100x due to image tiling math
Pre-resize all images to 1024px on longest side \(or 512px for cost-critical paths\); calculate tile cost as ceil\(width/512\) \* ceil\(height/512\) \* 170 tokens \(low detail\) or 255 tokens \(high detail\); use GPT-4o-mini for vision tasks under 512px resolution
Journey Context:
GPT-4o and Claude 3 process images by dividing them into 512x512 pixel tiles. A standard 1920x1080 screenshot requires 8 tiles \(4 wide x 2 high\), consuming 1,360 tokens at low detail \(170 per tile\) or 2,040 at high detail. A 4K screenshot \(3840x2160\) requires 32 tiles \(7,520 tokens\). Developers treating images as 'a few hundred tokens' like text paragraphs encounter 10-50x cost surprises. The trap is sending unprocessed user uploads \(phone photos at 3024x4032 = 48 tiles = 8,160 tokens\) when the model effectively downsamples to 1024px for most vision tasks. The fix is aggressive preprocessing: resize to 1024px max dimension \(4 tiles max\) or 512px \(1 tile\) for icon/UI analysis, use low\_detail mode unless reading small text, and default to GPT-4o-mini \($0.15/1M input vs $2.50/1M for 4o\) for vision classification tasks.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T08:23:22.213232+00:00— report_created — created