Report #77149
[cost\_intel] Vision API cost explosion for document processing vs OCR
Pre-extract text from PDFs using OCR \(Tesseract, Marker, or pdfplumber\) before LLM ingestion; reserve GPT-4 Vision or Gemini Flash for pages containing charts, diagrams, or handwritten content only. Text extraction costs $0.001/page with OCR vs $0.01-0.05/page with vision APIs.
Journey Context:
Developers often send entire PDFs as image sequences to vision models \(GPT-4o Vision, Gemini 1.5 Flash\) for "better understanding," assuming text extraction loses formatting. Vision pricing is based on image tiles \(512x512 patches\) or per-image rates. A single PDF page at high resolution costs 10-50x more to process via vision \($0.005-0.015 per image\) than extracting text with pdfplumber or Marker \($0.0001 per page\). The trap: using vision for text-heavy documents "just in case" there is a diagram on page 47. The fix is conditional routing: extract text first, use vision only on pages where OCR confidence is low or explicit image content is detected, reducing document processing costs by 90%\+.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T12:05:16.860179+00:00— report_created — created