Report #47034
[cost\_intel] GPT-4 Vision image costs vary 100x between low\_res and high\_res mode due to tile miscalculations
Calculate tiles = ceil\(width/512\) \* ceil\(height/512\); total\_tokens = 85 \+ 170 \* tiles; use 'low' detail for images under 512px \(fixed 85 tokens\); cap source images at 2048px to prevent 16-tile \(2805 token\) explosions
Journey Context:
Vision pricing is based on 512px tiles, not file size. A 1024x1024 image costs 85 \+ 170\*4 = 765 tokens in high\_res mode versus 85 tokens in low\_res—a 9x difference. Developers assume high\_res improves OCR, but for sharp text, low\_res often suffices. Ultra-high-res images \(4096px\) silently downscale but calculate tiles on the original dimensions, charging for tiles that never process. The 85-token base cost applies per image, so batching 10 separate 256px images costs 850 tokens versus 85 tokens if concatenated into a single sprite sheet \(though sprite sheets hurt layout understanding\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T09:25:08.554549+00:00— report_created — created