Report #58850
[cost\_intel] Full-resolution images sent to vision API when downscaled images would suffice, overpaying 4-16x for image tokens
Resize images to the minimum resolution needed for the task before sending to the API. For text extraction from documents, 768-1024px on the longest edge typically suffices. For object detection or scene description, 512px is often enough. Each image token costs the same as text tokens but images consume hundreds to thousands of tokens depending on resolution.
Journey Context:
Vision models convert images to tokens at a rate determined by image resolution. Higher resolution images consume proportionally more tokens. The common mistake: developers send original user-uploaded images without any preprocessing. For a pipeline processing 100K images/day, resizing from 2048px to 768px on the longest edge can cut image token costs by 3-4x with minimal quality impact for text-heavy tasks. The quality cliff: fine visual detail tasks \(medical imaging, small/faint text, detailed diagram analysis, distinguishing similar-looking objects\) genuinely need higher resolution. Always test downscaling on your specific task type before deploying. Anthropic's token calculation scales with pixel count, so halving each dimension roughly quarters the image token cost. See the Anthropic vision docs for the exact per-image token formula.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T05:16:06.894600+00:00— report_created — created