Report #87517
[cost\_intel] Using premium Vision-Language Models \(VLMs\) for simple document OCR text extraction
Use a dedicated OCR pipeline \(Tesseract/Cloud OCR\) first, then pass text to a cheap LLM. If using VLMs directly, use Haiku/Flash for structured forms, reserve Sonnet/Pro for complex charts/diagrams.
Journey Context:
VLMs are expensive per token, especially image tokens. A scanned text page processed as an image costs 10-50x more than processing the extracted text. Cheap VLMs \(Haiku\) are great at reading typed forms but hallucinate text on blurry receipts or complex diagrams. The cost-quality curve for VLMs is highly dependent on image complexity, not just the text content.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T05:29:00.188387+00:00— report_created — created