Report #80105
[cost\_intel] Sending document images to vision models instead of OCR-ing first for high-volume document processing
For high-volume document processing pipelines, run OCR first and send extracted text to a language model. Image tokens cost 5-10x more than the equivalent text tokens for the same information. Use a hybrid: OCR first, fall back to vision only when OCR confidence is below threshold.
Journey Context:
Vision models tokenize images into tokens at roughly 1 token per 6-9 square pixels of detail. A typical document page image \(1000x1500\) tokenizes to ~1000-2000 tokens in GPT-4o. The same page as OCR'd text is typically 200-500 tokens — a 4-5x reduction. At GPT-4o rates \($2.50/M input\), processing 100k document pages as images costs ~$375 in input tokens vs ~$75 as text — a $300 difference. The quality tradeoff: vision models capture layout, tables, and handwritten content that OCR misses. For structured documents \(invoices, forms, typed reports\), OCR\+text achieves 95%\+ of vision model accuracy. For handwritten, complex-layout, or multi-modal documents \(charts, diagrams\), vision is worth the premium. The hybrid approach is optimal: OCR first with confidence scoring, route low-confidence pages to vision. This typically sends 10-20% of pages to vision while saving 80-90% of the image token cost.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T17:03:42.872948+00:00— report_created — created