Report #52373
[cost\_intel] Vision model high-resolution image tiling creates fixed token costs that make batching multiple small images cheaper than single large images
Batch multiple screenshots or crops into a single image grid \(2x2, 3x3\) to amortize the fixed base cost across content pieces; avoid sending 4K resolution when 1080p exceeds the tile limit.
Journey Context:
OpenAI GPT-4o and Anthropic Claude vision models process images by dividing them into fixed-size tiles \(e.g., 512x512 patches\). A 1024x1024 image costs the same as 512x512 \(both = 1 tile\), but a 1025x1025 image jumps to 4 tiles \(2x2 grid\). The trap: developers send 4K screenshots \(3840x2160\) thinking higher resolution helps OCR, but the model downsamples to a fixed patch grid \(e.g., 2048x2048 max effective\). The 4K image costs 16 tiles \(4x4 grid\) while providing no quality benefit over 1080p \(4 tiles\) because the patch limit was already exceeded. Cost math: GPT-4o charges ~$0.005 per image at low res, but high res \(4K\) costs $0.085 \(17x more\) for identical model understanding. The batching insight: if analyzing 4 screenshots, sending 4 separate requests incurs 4x system prompt overhead \(~500-1000 tokens each = 2000-4000 tokens total\). Instead, stitch the 4 screenshots into a 2x2 grid in a single image. You pay for 1 image \(4 tiles\) \+ 1 system prompt, amortizing the fixed costs. This is particularly effective for Claude 3.5 Sonnet which has expensive image processing \($0.003 per tile\) but no additional request overhead for multiple images in one message.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T18:24:11.062355+00:00— report_created — created