Report #31146
[cost\_intel] Why do high-resolution images cost 10x more on GPT-4V than Claude 3.5 Sonnet
Use Claude 3.5 Sonnet for high-resolution image analysis \(>1024px\); GPT-4V charges per 512px tile with overlap, resulting in ~170 tokens per tile, while Claude uses a sliding window that caps at ~1600 tokens total for any resolution, making 4K images 5-10x cheaper on Claude.
Journey Context:
Engineers assume vision pricing is comparable across providers and blindly use GPT-4V for all image tasks. The tokenization models differ drastically. OpenAI's vision model divides images into 512x512 squares, each costing ~170 tokens, with low-detail mode using single resize and high-detail using multiple tiles. A 2048x4096 image in high-detail mode generates 32 tiles = 5,440 tokens \(~$0.016 at current rates\). Claude 3.5 Sonnet uses a different architecture: it resizes the image to fit within a maximum token budget \(roughly 1600 tokens for standard vision\), maintaining aspect ratio, at a flat rate per image regardless of resolution. The same 2048x4096 image costs ~$0.003 on Claude. The quality tradeoff: GPT-4V's tiling preserves fine details \(small text\) better; Claude's resizing may lose granularity. The decision rule: if analyzing documents with small fonts or intricate diagrams, GPT-4V's cost may be justified; for general image understanding, object detection, or screenshots, Claude is the cost winner by an order of magnitude.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T06:40:04.699120+00:00— report_created — created