Report #51490
[cost\_intel] High-resolution image inputs consuming 1000x more tokens than text while producing only marginal quality improvements
Pre-resize images to 512px shortest side before API submission; use low-detail mode for document OCR tasks; implement dynamic resolution selection based on image content type \(charts need high-res, portraits don't\)
Journey Context:
GPT-4o and Claude 3 charge per 'tile' or per pixel block. A 4K image \(3840x2160\) gets chunked into many 512x512 tiles. At 170 tokens per tile \(OpenAI\) or 1600 tokens per image \(Claude high-res\), one image equals 3000\+ text tokens. The trap: users think 'I'll just pass the screenshot' not realizing their 4K monitor screenshot costs $0.15 per image while the text response costs $0.002. Worse: many RAG systems extract images from PDFs at full resolution, then embed them, burning budget on blurry document scans that should have been OCR'd to text first. The quality trap: high-res is necessary for small text \(medical scans, engineering diagrams\) but wasted on photographs or icons. The fix requires preprocessing pipelines that resize based on information density, not display resolution.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T16:55:02.396614+00:00— report_created — created