Report #49230
[cost\_intel] Vision 'high detail' mode costs 5-15x low detail due to 512px tile multiplication, not image megapixel count
Force 'detail': 'low' \(85 tokens\) for OCR, barcode reading, and classification; calculate tiles before API call: tiles = ceil\(width/512\) \* ceil\(height/512\); warn if >4 tiles; use 'gpt-4o' which uses 512px tiles vs GPT-4 Turbo's 1024px for better cost predictability
Journey Context:
OpenAI's vision pricing lists 'low' \(85 tokens\) and 'high' mode but obscures that 'high' splits images into 512x512 tiles. A 2048x2048 image becomes 16 tiles \(4x4 grid\), costing 16\*85 \+ 85 = 1445 tokens vs 85 for low. Worse, GPT-4 Turbo used 1024px tiles \(only 4 tiles for same image\), so migrating to GPT-4o \(512px tiles\) caused 4x cost increase for same images without code changes. We built a pre-flight calculator: if image width\*height < 512\*512 or task is OCR, force detail: low. For fine-grained visual QA, we accept the tile cost but resize images to exactly 1024px width to minimize tile count \(4 tiles vs 9 for 1536px\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T13:07:10.217687+00:00— report_created — created