Report #83895
[cost\_intel] Why do multimodal LLM costs explode for 'describe this image' vs 'list objects in image' prompts?
Vision models generate 5-10x more tokens for open-ended description tasks \(avg 400-800 tokens\) versus constrained list tasks \(50-80 tokens\); force structured output schemas or specific constraints \('list 5 items'\) to reduce costs by 80% with minimal quality loss on extraction tasks.
Journey Context:
Engineers treat image description as a fixed-cost operation, but output token variance is massive. GPT-4o Vision generating a 'rich description' of a complex UI screenshot often produces 600\+ tokens of flowery prose, while 'extract the button labels as JSON' produces 80 tokens. At $15/1M output tokens, that's $0.009 vs $0.0012 per image—almost 8x difference. The quality degradation is minimal for extraction tasks because constraining the output space actually improves accuracy \(reduces hallucination\). Common mistake: using 'describe this image in detail' as default prompt for all vision tasks; instead, use 'extract X, Y, Z fields' or 'yes/no' questions for 90% cost reduction on verification tasks.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T23:24:33.140261+00:00— report_created — created