Report #90495
[cost\_intel] High-resolution vision mode consuming 10-50x tokens vs low-res due to tile splitting \(512/768px tiles\)
Use 'low' detail mode for OCR and basic image understanding; reserve 'high' detail for fine-grained visual reasoning only. Pre-resize images to exactly the tile boundary \(e.g., 1024px for GPT-4o = 4 tiles\) rather than slightly over to minimize tile count.
Journey Context:
Vision models \(GPT-4o, Claude 3.5/3.7, Gemini\) process high-resolution images by splitting them into tiles \(e.g., 512x512 for GPT-4o, 768x768 for Claude\). A 2048x2048 image creates 16 tiles \(4x4 grid\). Each tile costs 200-300 tokens \(GPT-4o charges 85 base \+ 170 per tile in high-res mode\). A single high-res image can cost 3,000-5,000 input tokens \($0.01-0.015 at GPT-4o rates\), compared to 100-200 tokens for low-res mode. The trap is that 'auto' detail mode often selects high-res for any image >512px, silently exploding costs. The fix is to explicitly set detail: 'low' unless fine detail is required \(e.g., reading small text in diagrams\), and to pre-resize images to exactly the tile boundary \(e.g., 1024x1024 = 4 tiles\) rather than slightly over \(1152x1152 = 9 tiles\), which creates a 2.25x cost difference for minimal quality gain.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T10:29:23.512952+00:00— report_created — created