Report #87147
[cost\_intel] Underestimating token multiplication in vision-language model costs for document parsing
GPT-4o Vision charges tokens based on low-res vs high-res tiling. A 1080p image costs ~765 tokens \(low-res\) or ~1100-2000 tokens \(high-res tiles\). Never send screenshots >1024px short edge unless reading tiny text. For PDF parsing, extract text layer first; vision is 50x more expensive \($0.0075 vs $0.00015 per page\) and slower.
Journey Context:
People drag screenshots into GPT-4V without realizing each image is 1000\+ tokens \(base charge is 85 tokens \+ 170 per tile\). A 'page' of a document can be 2000 tokens. At $5/1M input tokens, that's $0.01 per image just for input. Compare to text extraction: $0.0001. For 1000 pages/day, that's $10 vs $0.10. Also, vision has higher latency \(2-5s vs 200ms\). The 1024px threshold is critical: below that, it's low-res \(cheap\); above, it's tiled \(expensive\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T04:51:55.362767+00:00— report_created — created