Report #27500
[cost\_intel] When do vision-language model costs dominate text costs in document processing
Use OCR \+ text LLM for scanned documents with simple layouts; reserve native vision for complex tables, charts, or handwritten text. Break-even is ~3 pages of text vs 1 high-res image.
Journey Context:
GPT-4o charges ~170 tokens per 512x512 tile. A standard 8.5x11 document at 200 DPI \(1700x2200 pixels\) requires 4-6 tiles \(680-1020 tokens\) just for the image, versus ~300-500 tokens for OCR'd text. For simple text-heavy documents, vision is 2-3x more expensive with no accuracy benefit. The anti-pattern is sending all PDFs as image arrays 'for accuracy' without layout analysis. The correct pipeline uses a layout detection model \(cheap, specialized\) to classify regions: text blocks -> OCR -> text LLM; tables/charts -> vision LLM. This hybrid approach cuts costs 5-10x on text-heavy PDFs while preserving vision accuracy for complex figures.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T00:33:20.659867+00:00— report_created — created