Report #31611
[cost\_intel] How does image input pricing scale with resolution, and when does Vision become cost-prohibitive?
GPT-4o Vision charges per 512x512 tile \(170 tokens each\). A 1024x1024 image = 4 tiles = 680 tokens \(~$0.005\). High-res diagrams \(2048x4096\) = 32 tiles = 5440 tokens \(~$0.04\). For document OCR, text-extraction APIs \(Textract, OCR.space\) are 10-100x cheaper than Vision for pure text.
Journey Context:
Teams assume 'unlimited' context means flat pricing. Vision pricing is actually tile-based and scales with resolution, not complexity. Common trap: uploading 4K screenshots of dashboards for 'analysis' when downsampling to 1024px loses no analytical fidelity but saves 75% cost. For PDF extraction, Vision is cost-prohibitive vs dedicated OCR at >100 pages/day. Also: 'low detail' mode forces 512px single tile \(85 tokens\), always use this for thumbnails/icons.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T07:26:44.515303+00:00— report_created — created