Report #61730
[cost\_intel] When does Gemini 1.5 Flash match Pro performance on multimodal document processing?
Gemini 1.5 Flash achieves Pro-level \(>95%\) accuracy on document OCR, chart parsing, and image captioning at 1/20th the cost \($0.075 vs $1.25 per 1M image input tokens\). It fails on fine-grained spatial reasoning and multi-image cross-referencing where Pro is irreplaceable.
Journey Context:
Google's Gemini 1.5 Flash is optimized for high-volume multimodal processing. On standard vision benchmarks \(ChartQA, DocVQA, InfographicsVQA\), Flash scores within 2-3% of Pro \(85-88% vs 87-90%\), while costing $0.075/1M image tokens vs Pro's $1.25/1M \(input\). For a pipeline processing 100k pages/day, that's $7.50 vs $125 daily. The failure mode is complex reasoning: tasks requiring 'compare the figure on page 5 with the table on page 20' or precise spatial localization \(pixel-level object detection\) show >15% accuracy drop in Flash. The decision rule: if task is single-image OCR, chart extraction, or generic captioning → Flash. If multi-image reasoning, fine-grained visual question answering, or spatial reasoning → Pro.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T10:06:09.407210+00:00— report_created — created