Report #58642
[cost\_intel] High-resolution images silently consuming 10-50x expected tokens due to 512px tile encoding
Pre-resize images to 1024px max short-side before API call; use 'detail: low' \(85 tokens\) for thumbnails; calculate cost manually: ceiling\(width/512\)\*ceiling\(height/512\)\*85 \+ 85 base. Never send 4K screenshots \(3840x2160 = 8\*5 tiles = 3400 tokens\).
Journey Context:
Vision models encode images into 512x512 tiles, not raw pixels. A 2048x2048 image isn't 4x a 1024x1024; it's ceiling\(2048/512\)^2 = 16 tiles vs 4 tiles. At ~85 tokens per tile plus base, a 4K screenshot \(3840x2160\) is 8\*5=40 tiles = 3400 tokens. At $10/1M tokens \(Claude 3.5 Sonnet\), that's $0.034 per image vs $0.00085 for low-res — 40x difference. The trap is developers sending 'full page screenshots' for debugging. Compression doesn't help because tiles are based on dimensions, not file size. The right call is aggressive client-side resizing: downsample to 1024px max dimension \(4 tiles = 340 tokens\) and use low-detail mode unless fine text OCR is required.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T04:55:12.132647+00:00— report_created — created