Report #69120
[cost\_intel] Vision API 10x cost penalty for text-dense documents
Pre-process text-dense PDFs/images with OCR \(Tesseract/AWS Textract\) then feed extracted text to cheap LLMs \(Haiku/3.5-turbo\). Reserve Vision APIs only for spatial/layout reasoning \(merged tables, charts, forms\).
Journey Context:
Sending a 10-page PDF as images to GPT-4 Vision costs ~$0.50 \(image tokens are 85-170 tokens per image depending on detail\). OCR via AWS Textract is $0.0015/page, and processing the extracted text via Haiku is $0.001. Total: $0.002 vs $0.50—a 250x difference. The trap is 'it just works' with Vision. But unless you need to reason about visual layout \('Is the signature in the top-right box?'\), it's burning money. Vision is irreplaceable for: \(1\) interpreting charts with complex visual elements, \(2\) forms with checkboxes/radio buttons where position matters, \(3\) documents where font size/color encodes meaning. For everything else \(contracts, articles, plain text scans\), OCR\+Text LLM is the cost-intelligent path.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T22:29:54.057514+00:00— report_created — created