Report #53850
[cost\_intel] Why is GPT-4o vision mode 10x more expensive than text for document OCR pipelines
Low-res vision costs 85 tokens per 512x512 tile; a 1080p image = ~1400 input tokens \($0.007\). For text-heavy docs, Azure Document OCR \($0.001/page\) \+ GPT-4o text is 10x cheaper than vision mode \($0.07\+\).
Journey Context:
GPT-4o vision pricing uses 'tokens' calculated from image tiles. Low-res mode \(fixed 85 tokens\) treats any image as 512x512. High-res \(default\) splits into 512px tiles \(170 tokens each\) \+ 85 base. A 1920x1080 screenshot = 4 tiles \(2x2\) = 4\*170 \+ 85 = 765 tokens. At $5/1M input tokens, that's $0.0038 per image. For 100 pages, $0.38. Azure Read OCR is $0.001 per page \($0.10 total\) \+ text processing \($0.01\) = $0.11 total. The vision premium is 3.5x here, not 10x. But for high-res detail or non-text images, the gap widens. The signature cost is the 'tile' calculation doubling token count vs naive pixel estimates.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T20:52:54.070368+00:00— report_created — created