Report #75467
[cost\_intel] Using GPT-4V for high-volume document OCR where Gemini Flash suffices
For high-volume printed document OCR \(>1000 pages/day\), use Gemini 1.5 Flash instead of GPT-4V or Claude 3.5 Sonnet. Flash matches printed text accuracy \(>99%\) at 1/20th the cost \($0.075/1M tokens vs $2.50/1M for 4V\). However, maintain a fallback to GPT-4V for handwritten text, low-contrast scans, or complex tables—Flash accuracy drops to 70-80% on degraded inputs where 4V maintains >95%.
Journey Context:
Engineers default to GPT-4V for all vision tasks due to early benchmarks, not realizing that OCR is a narrow task where smaller vision models excel on clean inputs. The cost delta is massive: processing 10k pages with GPT-4V at 1k tokens/page costs $25 in input tokens alone; Gemini Flash costs $0.75. The quality cliff is sharp: Flash struggles with handwriting, rotated text, and poor lighting—exactly where GPT-4V's reasoning helps disambiguate. The pattern: use Flash as a 'filter'—send to 4V only when Flash confidence \(available in some API responses\) is low or on explicit handwritten text detection.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T09:16:28.984929+00:00— report_created — created