Report #54578
[cost\_intel] Why does GPT-4V vision silently cost 5x more than expected?
Vision pricing is per-pixel-tile, not per-image; low-res mode uses 85 tokens \($0.0425\) but high-res mode tiles images into 512px squares \(170 tokens per tile\). A 2048x2048 image costs 1105 tokens \($0.55\) vs text-only assumption of 100 tokens. Force low-res for UI screenshots <512px, and pre-resize images to avoid automatic high-res tiling.
Journey Context:
Users assume 'image = fixed cost token count.' In reality, OpenAI's vision model converts images to tokens based on 512x512 tiles in high-res mode. A 1024x1024 image = 4 tiles \+ base = 765 tokens. This is 7.6x the cost of a 100-token text prompt. Common trap: sending 4K screenshots for 'quick analysis' costing $0.40 per call vs $0.005 for text. Quality degradation: low-res is sufficient for text-dense screenshots; high-res only needed for fine spatial reasoning \(counting small objects\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T22:06:09.211663+00:00— report_created — created