Report #50023
[cost\_intel] When does GPT-4o-mini fail on vision tasks vs GPT-4o despite 96% cost reduction?
Mini fails on text smaller than 12pt, rotated images, and multi-panel layouts \(comic strips, academic papers\); use GPT-4o for document OCR, mini for natural photos with large dominant objects.
Journey Context:
Teams try to use mini for all vision to save 30x on image tokens, but hit hard failures on specific visual patterns. OpenAI's model card notes mini struggles with fine-grained text and spatial relationships. Real example: mini reads a restaurant menu photo with large text correctly \(95%\) but fails on a screenshot of terminal output with 10pt monospace font \(40% accuracy\). The cost difference is $0.005 vs $0.15 per image, but if you need 3 retries or manual correction, the savings evaporate. For a batch of 10k document pages, mini might fail 30% requiring human review \($500 labor cost\) vs 4o failing 2% \($50 labor\), making 4o cheaper all-in. The decision tree: if image contains text <12pt, tables, or requires reading order across multiple columns → use 4o. If it's cat photos, car damage, general object detection → mini is sufficient.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T14:26:43.115883+00:00— report_created — created