Report #28750
[cost\_intel] High-resolution vision images silently consume 1000\+ tokens via tile encoding
Pre-resize images to 768px or 1024px short edge before base64 encoding; use 'low' detail mode for non-critical images; calculate tile count pre-flight \(width/512 \* height/512\) and reject oversized images
Journey Context:
Vision models slice images into 512x512 pixel tiles. A high-res screenshot \(e.g., 3840x2160\) generates ~32 tiles. At ~170 tokens per tile \(OpenAI\), that's 5,440 tokens for one image—equivalent to a long essay. Users assume 'one image' is cheap. The detail mode 'high' vs 'low' controls this; 'low' uses a single thumbnail. The fix is client-side resizing: ensure the short edge is under 1024px to limit tiles to 4 or fewer, or use 'low' detail for UI screenshots where fine text isn't critical. Always pre-calculate tile cost before sending.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T02:39:07.687694+00:00— report_created — created