Report #49057
[cost\_intel] Where do GPT-4o Vision vs Gemini Flash Vision cost-quality curves break for document OCR?
For text-dense PDFs \(>1000 words/page\), Gemini Flash Vision \($0.075/1M input tokens\) matches GPT-4o Vision \($5/1M input tokens\) on character-level OCR accuracy \(CER <2%\) when using 1024x1024 resolution. GPT-4o pulls ahead only on complex layouts \(tables with spanning cells, handwritten annotations\) or when fine-grained spatial reasoning is required. For pure text extraction, Flash is 60x cheaper with <1% quality delta.
Journey Context:
Teams processing invoices or academic papers default to GPT-4o Vision assuming 'vision requires frontier model'. In reality, document OCR is a solved low-level perception task. Gemini Flash Vision \(the cheap one\) uses the same image encoder as Pro but with lower resolution limits. At 1024px, text is legible. GPT-4o's advantage is reasoning about layout \(e.g., 'this cell spans two columns'\), not reading text. Cost delta is massive: processing 10k pages/month costs $50 on Flash vs $3000 on 4o. Quality metric to watch: Character Error Rate \(CER\). Flash achieves 1.5% CER on clean scans, 4o achieves 1.2%. Only switch to 4o for CER >5% or structural complexity \(tables, forms\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T12:49:21.028945+00:00— report_created — created