Report #96158
[cost\_intel] When is GPT-4V/Claude 3 Opus vision mode 20x cheaper than OCR\+text LLM for document understanding?
Use native vision models for complex layouts \(tables, forms, handwritten notes\) when document is <10 pages; for pure text extraction from clean PDFs, OCR \(Tesseract/DocAI\) \+ Haiku is 5-10x cheaper and faster. Vision models win on structural understanding but lose on per-page costs at volume \(>1000 pages/day\).
Journey Context:
Developers pipeline OCR \+ GPT-4 for all documents, introducing failure modes on poor scans and paying double API costs. Vision models process raw images, eliminating OCR errors on handwriting but charging premium per-image rates. The crossover is document complexity: vision models handle 2D relationships \(tables, sidebars\) that OCR linearizes poorly. For simple text, OCR \+ cheap LLM is strictly better economics. The 1000 pages/day threshold is where per-image costs \($0.005-0.01/page\) exceed OCR\+LLM \($0.001/page\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T19:58:52.225630+00:00— report_created — created