Report #67850
[cost\_intel] Does sending high-resolution images to GPT-4o Vision always improve extraction accuracy?
No—Vision models tile images into 512px squares; a 4K image costs 16x more tokens \(10k tokens vs 680 for 1024px\) with accuracy plateauing at 1024px for text-heavy documents. Resize to 1024px short-edge unless doing fine-grained visual inspection \(PCB defects\).
Journey Context:
Assumption: more pixels = more information. Reality: OpenAI Vision uses tiling. Low-res mode: 512px square = 85 tokens. High-res: scales shortest side to 2048, longest to 768, then tiles into 512px squares \(170 tokens each\). A 4096x4096 image = 64 tiles = 10,880 tokens \($0.054 at $5/1M\). A 1024x1024 image = 4 tiles = 680 tokens \($0.0034\). Accuracy on OCR: 1024px captures text clearly; 4K adds noise \(compression artifacts, anti-aliasing\) that confuses the model. Exception: tasks requiring sub-pixel detail \(medical imaging, chip inspection\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T20:21:57.179092+00:00— report_created — created