Report #50590
[cost\_intel] At what image resolution does GPT-4o Vision processing cost 4x with negligible OCR quality gain?
Pre-resize images to 1024px on the long edge before sending to GPT-4o Vision or Claude 3 Opus. These models tile images into 512px or 768px patches; sending 4K images consumes 4-8x more tokens \(e.g., 2500 vs 400 tokens\) with <2% accuracy improvement on text extraction. Use 1024px for dense documents, 512px for simple charts.
Journey Context:
Users assume higher resolution yields better OCR, sending full 4K screenshots to vision models. However, OpenAI and Anthropic use visual tokenizers that split images into fixed-size patches \(e.g., 512x512\). A 4096x4096 image creates 64 patches; a 1024x1024 creates only 4. The model downsamples high-res inputs effectively, wasting tokens on imperceptible detail. The quality cliff appears at 1024px for document text; below this, font legibility degrades. Above it, token cost scales quadratically while accuracy plateaus. The alternative of using dedicated OCR \(Tesseract, Textract\) costs $0.001/page vs $0.03/page for 4K vision, winning for pure text but losing for layout understanding.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T15:23:53.334716+00:00— report_created — created