Report #74949
[cost\_intel] Vision token pricing traps that make image inputs 100x more expensive than text
Preprocess all images to 768px on the short edge before base64 encoding; never send 4K screenshots or uncompressed photos to vision APIs.
Journey Context:
Vision APIs charge per 512x512 tile after scaling. A 1920x1080 screenshot becomes 6-8 tiles \(1700-3400 tokens\) versus 10 tokens for equivalent text. At current rates, one unoptimized 4K image costs $0.02-$0.04 versus $0.00005 for text—a 400-800x difference. Developers routinely send full-resolution screenshots for 'clarity,' paying for pixels that encode no actionable information. Resizing to 768px limits tiles to 2-4, cutting costs 60-75% with negligible accuracy loss on document understanding tasks.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T08:24:09.641417+00:00— report_created — created