Report #52811
[cost\_intel] GPT-4 Vision high-res tile calculation creates 10-25x cost variance between low and high detail modes based on image aspect ratio
Default to "low" detail mode \(85 tokens flat\) for all screenshots and text-heavy images under 1024x1024; calculate exact tile count before sending: tiles = ceil\(width/512\) \* ceil\(height/512\); crop images to exact 512px multiples to avoid partial tile waste; disable "auto" detail setting which defaults to high for images >512px
Journey Context:
OpenAI's vision pricing has two tiers: "low" detail costs a flat 85 tokens regardless of resolution. "high" detail costs 85 base tokens plus 170 tokens per 512x512 tile. A 1920x1080 screenshot in high-res mode is split into ceil\(1920/512\)=4 tiles wide by ceil\(1080/512\)=3 tiles high = 12 tiles. Total cost: 85 \+ \(12\*170\) = 2,125 tokens. The same image in low-res mode costs 85 tokens—a 25x price difference. The trap: developers use "high" detail by default assuming users want maximum quality, or use "auto" which selects high for any image larger than 512px. For text-heavy screenshots \(the most common use case\), low-res mode captures text perfectly adequately \(it's still downsampled but OCR-capable\) at 1/25th the cost. The aspect ratio matters enormously: a 513x513 image triggers 4 tiles \(680 tokens\) while a 512x512 image uses 1 tile \(255 tokens\)—a 2.7x difference for 1px change. Production systems must crop images to exact 512px multiples or accept low-res mode for all non-photo content.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T19:08:27.800283+00:00— report_created — created