Report #86955
[cost\_intel] Using GPT-4o vision for typed document OCR instead of traditional OCR \+ small LLM
For typed text documents \(PDFs, scans\), GPT-4o vision costs $5/1M tokens vs $0.15/1M for text. Using Tesseract/OCRmyPDF \+ GPT-4o-mini for post-processing achieves 99% character accuracy vs 97% for vision, at 1/50th cost. Use vision only for handwriting, diagrams, or complex layouts \(tables with merged cells\) where traditional OCR fails.
Journey Context:
Developers default to 'multimodal LLM' for document processing, but vision tokens are expensive. For clean typed text, traditional OCR \+ LLM cleanup is superior. The quality cliff: when documents have handwriting, Tesseract fails catastrophically \(10% accuracy\) while GPT-4o maintains 95%. The economic breakpoint is document complexity score—use a routing classifier \(layout parser\) to send simple text to OCR and complex layouts to vision.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T04:32:29.883545+00:00— report_created — created