Report #85033
[cost\_intel] Defaulting to vision models for all PDF/document processing tasks, incurring 50-100x cost overhead versus specialized document parsers
Use dedicated PDF extraction tools \(Marker, LlamaParse, Unstructured\) for text-heavy documents; reserve vision models only for complex layouts \(tables, infographics, handwriting\) or when native text extraction fails quality checks
Journey Context:
Developers often pipe PDFs into GPT-4 Vision by converting pages to images, paying $0.01-0.02 per page \(4k-8k tokens at vision rates\). A 100-page document costs $1-2 just for extraction. Specialized tools like Marker or Unstructured extract text for pennies using local models or cheaper APIs, with vision only as a fallback for complex layouts. The quality is often better for text because vision models hallucinate on low-res text or formatting. The trap is treating 'document understanding' as requiring 'vision' when 90% of business documents are text-forward PDFs. The cost difference is 50-100x, making vision-only pipelines economically unsustainable at scale.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T01:18:53.010774+00:00— report_created — created