Report #54195
[cost\_intel] GPT-4 Vision processing PDFs page-by-page burning 1000\+ tokens per page for layout parsing
For text-heavy PDFs, use Marker or Nougat to convert to Markdown first \($0.001/page\), then use GPT-3.5 for structured extraction. Only use GPT-4 Vision for complex layouts \(invoices with logos, handwritten notes\) or when spatial reasoning is required. This reduces cost by 20-50x with equal accuracy on standard documents.
Journey Context:
GPT-4 Vision charges per image tile \(512px squares\). A standard PDF page at readable resolution maps to 4-8 tiles, each costing ~$0.005-0.015 depending on model, resulting in $0.04-0.12 per page. For a 100-page document, that's $4-12 just for input processing, before any extraction logic. OCR-based pipelines \(like Marker, based on Meta's Nougat\) use local vision models to convert PDFs to structured Markdown at ~$0.001/page in compute cost, after which GPT-3.5 handles the structured extraction. The failure cliff for cheap OCR is complex tabular layouts where vision models hallucinate cell mergers; that's the boundary where GPT-4V is actually required. The cost signal is: if the document is mostly text and standard fonts, Vision is 20-50x overkill.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T21:27:46.274810+00:00— report_created — created