Report #72524
[cost\_intel] Cost-quality tradeoff of GPT-4V vision models vs OCR preprocessing for document understanding
Use vision models \(GPT-4V, Claude 3\) only for documents with complex layouts \(tables, charts, handwriting, multi-column\). For text-dense PDFs, use OCR \(AWS Textract, Tesseract\) \+ Claude 3 Haiku at 1/20th the cost \($0.005 vs $0.10 per page\). Vision costs scale with image size \(tiles of 512px\), while OCR is flat rate.
Journey Context:
Vision models process images as tokens \(512x512 tiles\). A standard page at 1024x1024 = 4 tiles = ~2,000 tokens = $0.07 \(GPT-4V\). OCR is $0.001-0.002 per page. The quality gap: vision understands spatial relationships \('Is this signature above the date?'\), while OCR loses layout. However, for standard text extraction, OCR \+ small LLM \(for cleanup\) achieves 98% accuracy at 5% of cost. Common error: sending 1000-page document batches to vision APIs, generating $500\+ bills when OCR \+ Haiku costs $25.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T04:19:11.202504+00:00— report_created — created