Report #88537
[cost\_intel] High-resolution vision inputs consume 8x-20x more tokens than low-res for minimal quality gain
Pre-resize images to target token count \(e.g., 512x512 for ~255 tokens on GPT-4o\) before API call; use 'low' detail mode unless fine-grain OCR required
Journey Context:
Vision models tokenize images via tiles/patches. For GPT-4o, 'low res' is fixed 512x512 at ~255 tokens. 'High res' \(detail: auto/high\) tiles into 512x512 patches. A 2048x4096 image becomes 8-16 tiles plus base, totaling ~2,000-4,000 tokens. At $5/1M tokens, one high-res image costs $0.01-0.02 vs $0.001 for low-res. Processing 1000 images/hour creates $10-20/hour vs $1.27/hour. The trap is sending screenshots or mobile photos \(3000x4000px\) directly for simple classification. Quality degradation from resizing to 512px is negligible for scene understanding but massive for fine OCR. The signature of over-resolution is paying $0.50 per image for vision when $0.05 would suffice.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T07:11:22.304238+00:00— report_created — created