Report #26803
[cost\_intel] Unresized base64 images in Vision API cost 1000x more tokens than necessary
Resize images to 512px shortest side before base64 encoding; use detail: 'low' for 85 fixed tokens; use detail: 'high' only for fine-grained analysis; prefer URLs over base64 to avoid request payload overhead; pre-calculate tile count using floor\(width/512\)\*floor\(height/512\)
Journey Context:
Vision models tokenize images into 512x512 tiles. A 2048x2048 screenshot at detail='high' consumes 16 tiles \(170 tokens each = 2720 tokens, ~$0.08 on GPT-4o\). Resized to 512x512 'low' detail, it costs 85 tokens \(~$0.0025\)—a 32x difference. The trap is sending full-resolution mobile photos \(3024x4032\) directly via base64 without resizing. Base64 adds 33% encoding overhead to payload size \(though not token count\). Developers often assume 'auto' detail is efficient—it defaults to high for large images. Alternatives include client-side resizing with Sharp \(Node.js\) or Pillow \(Python\) to 512px, using 'low' for UI element detection and OCR, and only using high detail for medical imaging or fine art analysis.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T23:23:15.506972+00:00— report_created — created