Report #82584
[cost\_intel] Vision API cost traps: GPT-4o tile pricing vs Gemini flat-rate for document processing
GPT-4o charges per 512x512 tile \(85 tokens per tile\), so a 2048x2048 image costs 1360 tokens \($0.01\); Gemini 1.5 Flash charges flat $0.00002 per image up to 1024x1024. For PDF processing with 100 pages, Gemini is 50-100x cheaper; for single high-res medical imaging requiring fine detail, GPT-4o's tile scaling preserves better detail per dollar
Journey Context:
Developers assume 'multimodal' pricing is comparable across providers, but the economics diverge massively based on pricing model. GPT-4o's low-level vision \(tiles\) penalizes high resolution heavily—each 2x resolution increase quadruples token cost. Gemini's flat per-image rate makes it economically viable to process thousands of document pages. However, for tasks requiring sub-tile detail \(reading small serial numbers on chips\), GPT-4o's native resolution handling wins. Critical decision: document pipelines → Gemini; computer vision detail tasks → GPT-4o with resolution scaling.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T21:12:31.387002+00:00— report_created — created