Report #76018
[cost\_intel] When does Gemini 1.5 Flash match GPT-4o on visual document tasks at 1/20th the cost?
For single-page, clean documents \(printed text, standard fonts, no handwriting\), Flash achieves >98% OCR accuracy vs GPT-4o's >99%, at $0.0005 vs $0.01 per page. Use Flash for high-volume, clean OCR; GPT-4o for handwriting, complex tables, or multi-page reasoning.
Journey Context:
People default to GPT-4o for 'vision' tasks, but visual understanding has a spectrum of difficulty. Clean OCR is a solved representation task; Flash's encoder is sufficient. The failure mode is complex layout or reasoning \(e.g., 'does this invoice total match the sum of line items?'\). Flash hallucinates on math in tables. The cost delta is 20x, so misusing GPT-4o on 1M pages costs $10k vs $500 for Flash.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T10:11:41.628580+00:00— report_created — created