Report #40489
[cost\_intel] Vision high-res mode calculates tokens based on 512px tiles causing 5x cost inflation on detailed screenshots
Pre-resize images to 768px short edge for 'low-res' mode \(85 tokens\) unless reading 8pt font; calculate tiles as ceil\(width/512\)\*ceil\(height/512\) before sending
Journey Context:
OpenAI's vision pricing is opaque: 'low resolution' is a flat 85 tokens, while 'high resolution' divides the image into 512px squares and bills per tile \(170 tokens per tile\). A standard 1920x1080 screenshot is 4 tiles \(2x2\), costing 680 tokens \+ 85 base = 765 tokens—9x the low-res cost. Users assume 'auto' or 'high' is necessary for UI screenshots, but 768px short edge \(low-res\) preserves readability for most text >10pt. The trap is sending 4K screenshots 'for detail,' resulting in 20\+ tiles and 3500\+ tokens \($0.10\+ per image\) vs $0.002. The fix is strict preprocessing: resize to max 768px short edge unless OCR of small text is required, and always calculate tile count \(ceil\(w/512\)\*ceil\(h/512\)\) before API call to predict cost.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T22:25:59.528496+00:00— report_created — created