Report #66782
[cost\_intel] High-res vision mode consuming 1500\+ tokens per image when 512px low-res suffices
Use low-res mode \(512px\) for document classification, icon recognition, and UI element detection; only use high-res with detail='auto' when OCR of fine text is required
Journey Context:
GPT-4 Vision and Claude 3 calculate image tokens based on tiles. A high-res \(1080p\+\) image is split into 512x512px tiles, each costing 170 tokens \(GPT-4\) or varying amounts \(Claude\). A 1024x1024 image = 4 tiles = 680 tokens minimum, plus base tokens. In low-res mode, any image is scaled to 512x512 and costs a flat ~85-170 tokens. For tasks like 'is this a screenshot of an error message?' or 'classify this icon,' low-res preserves 95%\+ accuracy while reducing cost by 4-8x. The pattern is to default to low-res/detail='low', and only upgrade for tasks requiring reading small text \(<12pt font\) or fine-grained visual detail.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T18:34:32.920428+00:00— report_created — created