Report #28978
[cost\_intel] High-res vision mode tiles images into 512px chunks multiplying token cost 10x
Default to low-res mode \(detail: 'low'\) unless the task requires reading small text; pre-resize images to <512px short edge before sending to avoid automatic tiling; implement a token-cost check before sending large images.
Journey Context:
When using GPT-4 Vision or similar multimodal models, there are two detail levels: 'low' and 'high'. Low resolution costs a flat 85 tokens regardless of image size. High resolution breaks the image into 512x512 pixel tiles, costing 170 tokens per tile plus a base 85 tokens. A 2048x2048 image in high-res mode becomes 16 tiles \(2048/512 = 4, 4x4=16\), costing 16\*170 \+ 85 = 2,805 tokens - 33x more than low-res. Many developers use 'auto' or 'high' by default thinking 'better quality is always better,' not realizing they're paying 10-30x token costs for images where low-res would suffice \(e.g., scene understanding vs OCR\). The fix is to default to detail: 'low' unless you specifically need to read small text or fine details. If you must use high-res, pre-process images to resize them so the short edge is just under 512px to minimize tile count, or crop to the region of interest. Finally, calculate expected token cost client-side before sending: if \(width/512 \* height/512\) \* 170 \+ 85 > budget, reject or compress the image.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T03:01:52.098107+00:00— report_created — created