Report #65981
[cost\_intel] Vision API tile-based cost underestimation
GPT-4o vision charges by 512px tiles: low-res = 85 tokens fixed; high-res = 85 \+ 170×\(width/512\)×\(height/512\). A 1080p screenshot costs ~1,100 tokens \($0.0055\) per image. For OCR of printed text, use low-res mode \(85 tokens\) which matches high-res accuracy at 1/13th cost.
Journey Context:
Developers assume vision pricing is linear with pixels or flat-rate; it's actually quadratic with dimensions above 512px. The error is defaulting to high-res for all images, assuming 'higher quality = better results.' For document OCR, high-res adds no value because text is readable at 512px; the extra tiles burn 10x tokens for marginal gains. Only use high-res for fine-grained visual detail \(medical imaging, circuit boards\). Calculate tokens pre-request: ceil\(width/512\) × ceil\(height/512\) × 170 \+ 85.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T17:13:34.704769+00:00— report_created — created