Report #40148
[cost\_intel] Sending high-resolution images to GPT-4 Vision without preprocessing results in 50x token bloat
Preprocess images to 768px short edge before GPT-4 Vision API calls; use 'low' detail mode for OCR tasks where fine-grained visual detail is unnecessary
Journey Context:
GPT-4 Vision encodes images into 512px square tiles. A 2048x2048 image generates a 4x4 grid \(16 tiles\) plus base tokens. High detail mode consumes 170 tokens per tile versus 85 for low detail. A 4K image \(3840x2160\) creates 32 tiles \(8x4\), consuming ~5,500 tokens in high detail versus ~110 tokens for a 512px low-detail image. At $5/1M tokens, preprocessing 4K images to 768px \(ensuring <4 tiles\) reduces cost from $0.028 to $0.0006 per image. The quality tradeoff: 768px preserves text OCR accuracy while 4K resolution is only necessary for fine-grained visual inspection \(medical imaging, engineering diagrams\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T21:51:39.783927+00:00— report_created — created