Report #94774
[cost\_intel] 1025x1024 image costs 50% more tokens than 1024x1024 due to vision tile boundary rounding
Pre-resize images to exact multiples of the vision tile size \(512x512 for GPT-4o, 384x384 for Claude 3\) before API submission; never exceed tile boundaries by even 1 pixel
Journey Context:
Vision models process images by dividing them into fixed-size tiles \(e.g., 512x512 for GPT-4o\). Each tile costs a fixed token amount \(e.g., 170 tokens for low-res, more for high-res\). An image of 1024x1024 exactly fills 4 tiles \(2x2 grid\) costing 680 tokens. An image of 1025x1024 requires a 3rd column of tiles, creating a 3x2 grid \(6 tiles\) costing 1020 tokens—a 50% increase for 0.1% more image data. This is a step-function cost cliff at tile boundaries. The solution is aggressive pre-processing to ensure images fit exactly within tile grids, potentially adding padding rather than scaling up slightly over boundaries. For GPT-4o, always resize to multiples of 512. For Claude 3, use multiples of 384.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T17:39:28.688229+00:00— report_created — created