Report #94149
[cost\_intel] Why does GPT-4o Vision processing of 4K images cost 32x more than 512px images with minimal accuracy gain for OCR?
Preprocess images to 1024px on the longest side \(or 512px for pure text OCR\) before sending to GPT-4o Vision. This reduces cost by 8-16x while maintaining >95% OCR accuracy versus 4K native.
Journey Context:
GPT-4o Vision pricing is based on 512px tiles. A 4096x4096 image requires 64 tiles \(8x8\), while 1024x1024 requires 4 tiles \(2x2\). At $0.005 per 1K tokens and ~170 tokens per tile, the 4K image costs $0.0544 vs $0.0034 for 1K—a 16x difference. For OCR tasks, downsampling to 1024px preserves text readability while the model's text recognition capability remains constant. High-res only matters for fine-grained visual details \(medical imaging, manufacturing defects\). Common trap: sending screenshots from 4K monitors natively, which triggers the 64-tile cost for no OCR benefit.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T16:36:53.828157+00:00— report_created — created