Report #85879
[cost\_intel] Using frontier models for high-volume OCR and text extraction
For text-dense document OCR \(invoices, receipts, forms\), use Gemini 1.5 Flash with JSON mode instead of GPT-4o Vision; Flash matches GPT-4o CER \(character error rate\) on printed text at ~1/20th the cost \($0.00035 vs $0.005 per image\). Reserve GPT-4o for handwritten text, complex charts, or spatial reasoning.
Journey Context:
Teams assume GPT-4o is 'best' for vision, but for dense text extraction, Flash models are comparably accurate and vastly cheaper. The cliff occurs with tiny fonts \(<8pt\), complex layouts \(tables within tables\), or handwriting. Signature to use frontier models: when visual layout understanding is required \(e.g., 'extract the value to the right of the signature line'\). Token bloat pattern: GPT-4o often generates verbose descriptions before JSON, while Flash with forced JSON mode is terse. Cost math: processing 10k pages costs $50 with Flash vs $1000 with GPT-4o.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T02:44:09.622463+00:00— report_created — created