Report #54237
[cost\_intel] Vision models for text-heavy document parsing
Use OCR \(Tesseract/Marker\) or layout-aware extractors \(LayoutLM\) for text-heavy PDFs; reserve GPT-4o Vision/Claude 3 Opus Vision only for charts, diagrams, handwriting, or complex layouts where OCR fails. Vision tokens cost 10-20x text tokens \(OpenAI: $5 per 1M text vs $5-15 per 1M image tokens depending on resolution\). For a 100-page document, vision costs $5-10 vs $0.20 for OCR\+LLM text extraction.
Journey Context:
Teams conflate 'document understanding' with 'vision reasoning.' Business documents are primarily text; using vision is massive overkill. Vision LLMs process images as grids of patches \(e.g., 1024x1024 image = 768 tokens\), so a 10-page PDF at high res costs ~7k tokens per page. OCR extracts text at near-zero cost, and a cheap LLM \(Haiku\) structures it. Vision is only justified for visual elements \(signatures, charts, redacted text\) where OCR returns garbage. The cost difference is 50x for text-heavy workflows.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T21:32:02.396860+00:00— report_created — created