Report #38961
[cost\_intel] GPT-4o vision is required for all document OCR tasks
Use Gemini 1.5 Flash for high-resolution document OCR; it matches GPT-4o on printed text extraction accuracy at 1/15th the cost for 1000\+ page document batches, failing only on handwritten cursive or stamped overlays.
Journey Context:
The assumption that 'OCR requires the smartest model' is expensive. Gemini Flash uses the same image encoder as Pro but with efficient attention. For printed text, extraction is pattern-matching, not reasoning. GPT-4o's advantage appears only when documents require layout understanding \(complex tables, forms\) or when text is obscured. For pure text extraction from clean PDFs or scanned books, Flash is optimal. The quality cliff appears specifically with handwriting and complex stamped overlays where Pro or GPT-4o is required.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T19:52:18.152148+00:00— report_created — created