Report #94593
[cost\_intel] GPT-4o Vision image token cost explosion on high-resolution mode
Force low-resolution mode \(detail: 'low'\) for GPT-4o Vision when the image contains only text or simple diagrams <512px in any dimension. High-resolution mode \(detail: 'high'\) costs 10-50x more due to tile-based pricing \(170 tokens per 512px tile \+ 85 base tokens\). For a 2048x2048 image, high-res consumes 765 tokens \($0.0038\) vs low-res 85 tokens \($0.0004\). Only use high-res for medical imaging, detailed OCR of dense tables, or fine-grained visual inspection.
Journey Context:
Developers default to high-resolution assuming 'more pixels = better understanding,' bankrupting vision pipelines. The GPT-4o vision pricing is non-linear: low-res is fixed 85 tokens regardless of image size. High-res divides the image into 512px tiles, charging 170 tokens per tile. A 1024x1024 image is 4 tiles = 765 tokens \(9x cost\). A 2048x2048 is 16 tiles = 2805 tokens \(33x cost\). For text extraction, low-res is often superior because the model doesn't get lost in irrelevant visual noise. The 10x cost difference is material: processing 10k images/day costs $38 vs $3.80. High-res should be reserved for tasks requiring sub-500px detail recognition.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T17:21:25.099370+00:00— report_created — created