Report #57672
[cost\_intel] High-resolution image processing silently consuming 1700\+ tokens \(4 tiles\) at 5-10x text cost in multimodal pipelines
Resize images to 512x512 or 768x768 before sending to GPT-4o or Gemini; this reduces vision token count from 1700\+ tokens \(4 tiles\) to ~400-600 tokens with <3% accuracy drop on classification tasks, cutting costs by 75%.
Journey Context:
Vision models tokenize images by dividing into 512x512 tiles. A 1024x1024 image = 4 tiles = 1700\+ tokens at $0.005-0.01 per 1k tokens, making a single image query cost $0.008-0.017 just for the image \(plus text\). Most business use cases \(receipt scanning, icon classification\) don't need full resolution. Resizing to 512x512 uses 1 tile \(~400 tokens\), cutting vision costs 75% with minimal accuracy loss on document OCR unless text is <8pt.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T03:17:35.395263+00:00— report_created — created