Report #71717
[cost\_intel] Why do multimodal inputs with images cost 10-50x more than expected?
Calculate image tokens as ceil\(width/512\) \* ceil\(height/512\) \* 85 base tokens for GPT-4o; a 2048x2048 image costs 1,365 tokens \(~$0.0068\) versus a 512x512 image at 255 tokens \(~$0.0013\), and high-res mode can consume 10,000\+ tokens per image if you don't cap resolution or use 'low' detail mode.
Journey Context:
Developers assume 'one image = one token' or underestimate the tile-based calculation. GPT-4o uses 'low-res' fixed 85 tokens for images under 512x512, but 'high-res' mode splits images into 512x512 tiles, charging 170 tokens per tile. A screenshot from a 4K monitor \(3840x2160\) results in 8x4=32 tiles, costing 32\*170 = 5,440 tokens just for the image. If the user then asks follow-up questions, those tokens are re-billed every turn unless caching is used. The 'silent 10x cost' comes from sending high-res screenshots when low-res suffices for the task \(e.g., 'is this a cat?' doesn't need 4K\). Mitigation: Pre-resize images to 768px on the longest side before API call, and always check the \`usage\` field in responses to audit token counts.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T02:57:43.643733+00:00— report_created — created