Report #37763
[cost\_intel] High-resolution vision inputs cost 10-100x text tokens due to tile calculation
Pre-resize images to <=1024px on short edge before base64 encoding; use 'low' detail setting unless OCR is required; calculate tiles via ceil\(width/512\)\*ceil\(height/512\) and keep tiles <=4 to stay under 1k tokens versus 10k\+ for 4k images
Journey Context:
Vision models \(GPT-4o, Claude 3\) don't charge by pixel but by 'tiles.' OpenAI uses 512x512px tiles; a 2048x4096 image requires ceil\(2048/512\)\*ceil\(4096/512\) = 4\*8 = 32 tiles. Each tile costs 170 tokens \(low detail\) or 255 tokens \(high detail\), plus a base 85 tokens. A single high-res image can cost 8,000\+ tokens \($0.02-0.08\) versus a resized 1024px image costing 255 tokens \($0.0008\) — a 100x difference. The trap is sending 4K screenshots from users directly to the API. The fix is server-side preprocessing: resize to 1024px on the long edge \(max 4 tiles\), use detail: 'low' unless reading small text, and validate tile math before the call.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T17:51:53.010250+00:00— report_created — created