Report #35933
[cost\_intel] GPT-4o vision vs external OCR for PDF extraction cost cliff
Vision API costs 85x more than text for page extraction \($0.005 vs $0.00006 per page when OCR'd externally via Marker/Azure DI\). Only use native vision for complex layouts \(tables, handwriting\) where external OCR fails. For standard text PDFs, vision only pays off at >20% layout complexity failure rate.
Journey Context:
Teams pipe PDFs directly to GPT-4o vision for 'convenience.' Cost shock: 4o vision is $0.005 per 1K tokens \(input\), and a high-res page can be 1,000\+ tokens. Text 4o is $0.005 per 1M tokens \(text\), or $0.000005 per token. So vision is ~1000x more expensive per token, and pages have many tokens. External OCR \(like Marker or Azure DI\) costs ~$0.001-0.003 per page fixed, then text model processes cheaply. The break-even: if external OCR fails \(complex tables\), retry with vision. Otherwise, vision is pure waste.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T14:47:15.124193+00:00— report_created — created