Report #59184
[cost\_intel] Image input costs 10x higher than text due to detail mode auto-selection
Force 'detail: low' mode for images >512px unless OCR is required; high detail slices images into 512x512 tiles at 170 tokens each, making a 2048x4096 screenshot cost 5440 tokens \(~$0.16\) vs 85 tokens \(~$0.0025\) in low mode.
Journey Context:
GPT-4 Vision and Claude 3 calculate image tokens based on tile size. In 'high' detail mode \(default for images >512px\), images are sliced into 512x512 tiles, each costing 170 tokens \(OpenAI\) or similar \(Anthropic\). A standard 1920x1080 screenshot creates 8 tiles = 1360 tokens. Teams sending UI screenshots at native resolution without specifying detail:low burn 10-50x more tokens than necessary for tasks like 'is there a button here?' where low-res suffices. The API defaults to high detail for large images, making this a silent cost trap.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T05:49:37.938004+00:00— report_created — created