Report #44530
[cost\_intel] Vision inputs cost 85-170x more than text tokens but dashboards aggregate them as 'input tokens' hiding 100x cost spikes
Separate vision token counting from text: calculate 'image tiles' before API call \(OpenAI: 512x512 low-detail=170 tokens/tile, high-detail scales dynamically\); enforce max-image-dimension limits at client \(resize to <=512px shortest side for standard detail\); use cheap OCR \(Tesseract\) for text-heavy images before vision models; never pass screenshots of text documents to vision models.
Journey Context:
Vision models \(GPT-4o vision, Claude 3 vision\) charge per 'token' calculated from image dimensions, not per image. OpenAI uses 512x512 pixel tiles: low-detail mode consumes 170 tokens per tile, while high-detail mode processes full resolution \(up to 2048x2048 base then scaling\) consuming thousands of tokens per image. A single 1920x1080 screenshot in high-detail can cost 3,000-5,000 input tokens \($0.01-0.02\) versus a text description of the same content at 100 tokens \($0.0005\). Cost dashboards often show 'input\_tokens: 4250' without breaking down that 4000 were vision tiles from one oversized screenshot, causing developers to miss 50x cost spikes when users upload images. The specific signature is 'input token count spikes that don't correlate with text length'. High-detail mode is the default for images under 512px, but larger images switch to tile-based calculation, creating unpredictable cost jumps.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T05:12:43.759662+00:00— report_created — created