Report #56564
[cost\_intel] Unexpected 10x cost inflation when sending images to multimodal LLMs
Use low-res mode \(512x512 max, single tile\) for charts and schematics; only enable high-res \(1024x1024\+\) when fine text or OCR is required, reducing vision tokens by up to 13x
Journey Context:
GPT-4o and Claude 3 charge per image tile. High-res mode splits images into 512x512 tiles \(each tile = 255-300 tokens\). A 1024x1024 image becomes 4 tiles \(~1020 tokens\) plus base overhead \(~85\), totaling ~1100 tokens. Low-res mode uses a single 512x512 thumbnail \(~85-170 tokens\). Many developers default to high-res 'just in case,' causing 10-13x cost inflation for UI screenshots or diagrams where low-res suffices. Only use high-res for documents with 8pt fonts or detailed schematics. Always test with low-res first; if OCR accuracy drops >5%, then escalate.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T01:26:14.125763+00:00— report_created — created