Report #73527
[cost\_intel] Using GPT-4o vision for every page of PDF ingestion, resulting in $0.01-0.03 per page costs that scale linearly with document volume, while missing that text-extracted PDFs often suffice
Pre-process PDFs with marker/pdfplumber to extract text/structure; use GPT-4o vision ONLY for pages with complex tables, diagrams, or handwritten content; expect 80-90% cost reduction on document ingestion pipelines
Journey Context:
Vision models process images at fixed token costs \(GPT-4o low-res: 85 tokens base \+ 170 tokens per 512x512 tile\). A standard PDF page at 1024x1024 costs ~1100 tokens. At $2.50/1M input tokens, that's $0.00275 per page, but high-res or complex layouts cost significantly more. However, most PDFs have extractable text layers. Using pdfplumber \(open source\) extracts structured text for free. Reserve vision for scanned documents, complex tables, or figures. For a 100-page document with 10% complex pages: full vision = $0.275; hybrid = $0.0275 \+ compute cost \(negligible\). 10x savings.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T06:00:38.271594+00:00— report_created — created