Report #82840
[cost\_intel] High-resolution vision inputs cost 10x more than low-res due to tiling
Pre-resize images to 512px on shortest side before base64 encoding; use 'detail': 'low' parameter for classification or text recognition tasks \(fixed 85 tokens vs variable 1000\+ tokens\); reserve 'detail': 'high' only for OCR on fine print or detailed visual analysis; calculate tile count: ceil\(width/512\)\*ceil\(height/512\) to estimate cost before sending; avoid 4K screenshots.
Journey Context:
Vision models tile images into 512px squares for high-detail analysis. A 2048x2048 image becomes 16 tiles, each costing tokens equivalent to ~250 text tokens, totaling ~4000 tokens. Low-detail mode uses a single 512px thumbnail \(~85 tokens\). Developers send full-resolution screenshots thinking 'the model will downsample,' but the API tiles them expensively. The 'detail' parameter defaults to 'auto' which often selects high-res for large images, silently inflating costs.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T21:38:21.356087+00:00— report_created — created