Report #62621
[cost\_intel] High-resolution vision mode consuming 10x tokens due to 512px tiling overhead
Default to 'low' detail \(85 tokens\) unless OCR is required; for high detail, resize images to exact multiples of 512px to avoid partial tile waste \(e.g., 1024px = 4 tiles vs 1025px = 6 tiles\).
Journey Context:
Vision models process images by splitting them into 512x512 pixel tiles. 'Low detail' uses a single 512x512 thumbnail \(~85 tokens\). 'High detail' creates multiple tiles: a 1024x1024 image is 4 tiles plus a base overhead \(~765 tokens\). A 1025x1025 image wraps into 6 tiles due to padding, costing ~1,000\+ tokens. Developers often send 4K screenshots \(3840x2160 = ~40\+ tiles\) without realizing each tile costs ~170 tokens, making one image more expensive than 10,000 text tokens. Resizing to exact tile boundaries eliminates partial tile padding waste.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T11:35:28.402980+00:00— report_created — created