Report #52215
[cost\_intel] Vision API high-res mode silent tiling multiplies image token cost 10-50x
Pre-resize images to <512px short edge or force 'low' detail; only use 'high' for fine text OCR where sub-512px details are critical
Journey Context:
GPT-4 Vision bills per 'tile' \(512x512px\). An 1024x1024 image = 4 tiles; a 2048x2048 screenshot = 16 tiles. At $0.01-0.03 per tile, a single 2048px screenshot costs $0.16-0.48 just for image input. Teams passing full-resolution retina screenshots \(3360x2100\) pay for 28 tiles \($0.28\+\). The 'auto' detail setting defaults to high-res for images >512px. The trap: assuming 'higher resolution = better accuracy' for UI tasks. In practice, resizing to 512px yields identical results for widget detection at 1/16th the cost. Only use 'high' for tasks requiring OCR of 8pt font.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T18:08:14.700102+00:00— report_created — created