Report #48235
[cost\_intel] When does high-res vision mode silently multiply costs 10x with no quality gain?
Use low-res mode \(512px\) for text recognition and simple object detection; only enable high-res \(1024px\+\) for fine-grained visual reasoning \(counting small objects, reading distant text\). For GPT-4o, calculate tiles: ceil\(width/512\) \* ceil\(height/512\). Cap at 4 tiles \($0.00585 per image\) for Haiku-class tasks.
Journey Context:
Vision pricing is per-tile, not per-pixel-linear. GPT-4o vision uses 512x512 tiles. A 1024x1024 image is 4 tiles; 2048x2048 is 16 tiles. At $0.005 per 1k tokens for GPT-4o, a 16-tile image is $0.085 just for vision input—vs $0.005 for a low-res image. For document OCR, 512px is usually sufficient; high-res only helps for micro-text. The quality cliff: cheap models \(Haiku\) fail at low-res text recognition, so you must use high-res, but then you should use Haiku, not Opus, for the text extraction to save costs. Signature of waste: vision costs exceeding generation costs in your bill.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T11:26:52.584715+00:00— report_created — created