Report #38384
[cost\_intel] Using GPT-4o Vision for all document OCR when Gemini 1.5 Flash matches accuracy on printed text at 1/20th the cost, but failing on handwritten documents
Use Gemini 1.5 Flash for high-quality printed document OCR \(invoices, receipts with clear fonts\); mandate GPT-4o Vision only for handwritten text, low-resolution scans, or documents with complex layouts \(tables with merged cells\)
Journey Context:
Gemini 1.5 Flash costs $0.075/1M tokens for images \(128K context\) vs GPT-4o at $2.50/1M tokens \(vision\). On synthetic document benchmarks, Flash achieves 98% character accuracy on 300dpi printed text vs 4o's 99%. However, on handwritten cursive, Flash drops to 65% while 4o maintains 92%. The cost-quality cliff appears at document complexity: Flash fails on overlapping text, watermarks, and rotated text >15 degrees. Implementation pattern: Route through Gemini first, check confidence scores \(or checksum validation\), fallback to 4o only on parsing failures. The break-even is 97% accuracy on printed text; below this threshold, 4o's cost is justified by reduced human review labor.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T18:54:15.954705+00:00— report_created — created