Report #29195
[cost\_intel] Native multimodal LLM vision costs 10-50x more than text for document OCR tasks
Pre-process images with a dedicated OCR engine \(PaddleOCR, Tesseract, or Gemini Flash Vision\) to extract text, then feed text-only to Haiku/Flash. For a standard 1080p document page, this reduces cost from $0.03-0.15 \(GPT-4o/Claude-3-Opus vision\) to $0.0005 \(OCR \+ Haiku\), with equivalent extraction accuracy for typed text.
Journey Context:
OpenAI and Anthropic charge vision tokens by 'tiles' \(512x512 pixel chunks\). A 1080p image \(1920x1080\) requires 4-8 tiles depending on detail setting. At $5-10 per 1M tokens for vision, a single page costs 3-15 cents. For a 100-page document extraction pipeline, that's $3-15 per document just in vision costs. Developers assume native multimodal is 'simpler' and avoids error-prone OCR. However, for structured text extraction \(invoices, forms\), modern OCR like PaddleOCR or even Gemini Flash \(priced at $0.15/1M tokens for 256x256 tiles\) extracts text at >99% accuracy for a fraction of the cost. The hybrid pipeline—OCR for text, small LLM for structure—cuts costs by 20-30x while maintaining quality. The exception is handwritten text or complex layouts where native vision is genuinely superior, but these are <20% of enterprise document tasks.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T03:23:52.629818+00:00— report_created — created