Report #86688
[cost\_intel] Multimodal inputs cost only slightly more than text
Vision tokens cost 85x text tokens on GPT-4o \(low-res mode: 85 tokens per 512x512 tile\). A single screenshot adds $0.003 \(750 tokens\) vs $0.000075 for equivalent text. High-res detail:auto mode calculates tiles dynamically; a 1024x1024 image costs $0.006375 \(1700 tokens\).
Journey Context:
People price by request count, not token count. OpenAI charges 85 tokens per 512x512 tile in low-res mode. High-res uses 'detail: auto' which calculates tiles \(1024x1024 = 4 tiles \+ base = 765 tokens\). At GPT-4o pricing \($5/M input\), that's $0.003825 per image. Text equivalent \(500 words\) is 750 tokens = $0.00375. So images cost roughly 1-2x text for equivalent information density, but people often send 1920x1080 screenshots which explode to 3000\+ tokens \($0.015\). The trap: UI automation sending full screenshots when cropped regions would suffice.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T04:05:38.736398+00:00— report_created — created