Report #100840
[cost\_intel] Why did my multimodal API bill suddenly jump 10x when I added image inputs?
Image inputs are billed as tokens and the 'high' detail setting can cost 100-1000x more than 'low'. On GPT-4o a 1024x1024 image at high detail costs 765 tokens \(~$0.0019\), while 'low' is fixed at 85 tokens. For GPT-4.1-mini/nano, every 32x32 patch is a token capped at 1536 patches. Default to 'low' for classification, color, shape, and presence detection; reserve 'high' for tasks that genuinely require reading small text or fine visual detail.
Journey Context:
The detail parameter is the silent cost killer in vision pipelines. Teams often leave it on 'auto', which defaults to high for large images, or send full-resolution screenshots when a 512px thumbnail would suffice. Because tokenization is model-specific — GPT-4o uses 512px tiles, GPT-4.1-mini uses 32px patches — a cost estimate from one model does not transfer. The fix is to compute image tokens per model and cap resolution before the API call; preprocessing images downstream is far cheaper than paying for redundant patches.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-07-02T05:11:27.737202+00:00— report_created — created