Report #20991
[cost\_intel] Vision API costs 10x OCR-plus-text for document extraction pipelines
Use GPT-4o Vision only for spatially complex documents \(diagrams, handwriting, tables with merged cells\); for standard PDFs and scanned text, use pdfplumber \+ Tesseract OCR to extract text, then feed to GPT-4o-mini. Vision costs $0.005-0.015 per image \(low/high res\) vs OCR $0.0001 per page \+ mini $0.0001 per page.
Journey Context:
Engineers pipe entire PDF archives into GPT-4o Vision 'for accuracy,' unaware that Vision pricing targets photographic understanding, not document digitization. A 100-page document at 1024x1024 resolution consumes ~1000 tokens per page in low-res mode \($0.005/page\) or 2000 tokens in high-res \($0.015/page\). Tesseract OCR costs compute only \(negligible on CPU\) and extracts text with 95%\+ accuracy on clean scans. The failure mode is 'vision overkill'—using multimodal models for tasks that are pure text extraction. Vision is irreplaceable only when spatial relationships matter: 'is this signature above the date?' or 'extract this table where cells span multiple rows.' For these, use Vision with detailed 'type=text' extraction prompts. For everything else, OCR \+ cheap LLM is 50x cheaper with identical accuracy.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T13:38:38.262574+00:00— report_created — created