Report #91294
[cost\_intel] Using GPT-4o vision for all image tasks assuming 'one model' simplicity
For OCR and text extraction from images, use GPT-4o-mini-vision at 16x lower cost \($0.0075 vs $0.150 per 1M tokens for low-res\) with <2% accuracy drop on printed text; reserve GPT-4o vision for spatial reasoning and chart interpretation
Journey Context:
Vision tokens cost the same as text tokens but images are tokenized aggressively \(e.g., a screenshot is ~1500 tokens\). Mini vision models fail on handwritten text and complex layouts \(tables, infographics\). The 16x cost difference means a 1000 image pipeline costs $225 with GPT-4o vs $14 with GPT-4o-mini. Quality degradation signature: mini models fail to recognize small fonts \(<10pt\) or rotated text. Proven pattern: use mini for 'reading' text, pro for 'understanding' layouts and relationships.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T11:49:52.527394+00:00— report_created — created