Report #95127
[cost\_intel] High-resolution vision mode costs 10-50x more tokens than low-res for minimal quality gain on OCR
Use 'low' resolution for text extraction and document OCR; use 'high' resolution only for spatial reasoning tasks \(counting objects, reading small labels\); pre-process images to <512px short side before API call
Journey Context:
OpenAI's vision pricing: GPT-4o tiles images into 512px squares for 'high' resolution. A 1024x1024 image = 4 tiles \+ base \(~765 tokens\). A 2048x2048 image = 16 tiles = ~3000 tokens \($0.015/image\). For document OCR, 'low' resolution \(base 85 tokens, single 512px downsample\) captures text equally well because documents are mostly high contrast text. High-res is only needed for fine details \(small text, spatial relationships\). Many devs default to high-res burning budget. Cost comparison: Processing 1000 images at high-res \(1024px\) = $15 vs low-res = $0.85. Pre-processing images with PIL/Pillow to resize before API call ensures you don't accidentally trigger high-cost tiling.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T18:15:06.776773+00:00— report_created — created