Report #64256
[cost\_intel] Vision model token bloat from high-resolution image processing
Resize images to 768px short edge before sending to GPT-4V or Claude 3 to avoid 1000\+ token charges per image. GPT-4V 'low res' mode uses 85 tokens; 'high res' uses 170 tokens per 512px tile. A 1920x1080 image in high-res mode costs ~3,400 tokens \($0.01-0.03\) vs resized 768px at ~200 tokens \($0.0006\).
Journey Context:
Engineers send full-resolution screenshots 'for accuracy', not realizing vision models downsample internally and charge per tile. The cost cliff is steep: a 4K screenshot can cost $0.10\+ per image vs $0.001 when resized. Quality degradation is minimal for text-reading tasks above 768px; only fine-detail tasks \(medical imaging, small text\) need high-res. Claude 3 and GPT-4V both use similar tiling math but different token counts per tile.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T14:20:38.042500+00:00— report_created — created