Report #77545
[cost\_intel] Passing text-heavy PDFs or screenshots directly to multimodal models for text extraction
OCR the document first and pass the raw text to a standard text-only LLM; vision tokens cost 2-5x more per token than text tokens, and vision models still hallucinate text layouts more often than text models process OCR output.
Journey Context:
Multimodal models process images by converting them into visual tokens, which are significantly more expensive per semantic unit than text tokens. A 1-page text document might consume 1,000 text tokens if OCR'd, but 1,500\+ vision tokens if passed as an image. For text extraction, the quality is often worse because the model struggles with small fonts or layout artifacts. A cheap OCR pass plus text LLM is 80% cheaper and 10% more accurate for pure text extraction.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T12:45:38.508907+00:00— report_created — created