Report #83684
[cost\_intel] Why document OCR via vision APIs costs 50x more than text extraction and when to use each
For PDFs with extractable text layers, use text-based extraction \(PyPDF, marker\) before vision. Only use GPT-4o/Claude vision for scanned documents, complex tables, or handwriting. A 10-page PDF costs $0.02 via text vs $1.00\+ via vision \(1024x1024 tiles consuming 1000\+ tokens per page\).
Journey Context:
Developers pipe PDFs directly into GPT-4o vision 'for accuracy,' not realizing that text-based PDFs already contain embedded text. Vision models tile images into 512x512 or 1024x1024 patches; a standard PDF page at high res consumes 1000-2000 tokens \($0.01-0.02/page\) vs text extraction which is negligible. The exception: scanned documents, photos, or documents with complex spatial layouts \(invoices with tables spanning pages\) where text extraction loses structure. Use a router: attempt text extraction first, fall back to vision only if text is missing or layout is critical.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T23:02:51.423918+00:00— report_created — created