Report #93084
[cost\_intel] Vision tile quantization causes 2-3x cost spikes for images just over 512px boundaries
Pre-process images to exact multiples of 512px \(or 336px for GPT-4o\) on the short edge to minimize tile count; downsample 1025px images to 1024px to avoid jumping from 4 tiles to 9 tiles \(2.25x cost\).
Journey Context:
GPT-4o and Claude 3 calculate vision tokens based on 512x512 tiles \(or similar specific sizes\). A 1024x1024 image uses 4 tiles; a 1025x1024 image uses 9 tiles \(3x3 grid\) because it crosses the boundary. This creates a 'stair-step' cost function where 1px difference doubles or triples cost. The quality cliff is minimal—downsampling 1025 to 1024 loses no meaningful information but saves 55% of vision tokens. The fix requires image preprocessing pipelines that enforce tile boundaries strictly.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T14:49:51.844022+00:00— report_created — created