Report #87639
[cost\_intel] When does GPT-4o Vision cost 5-10x more than OCR preprocessing for document extraction?
For dense text documents \(>1000 words\), use OCR \(Tesseract/Claude Text\) first; use Vision only for spatial/layout-critical tasks \(forms, diagrams, UI elements\) or when text is <20% of image area.
Journey Context:
GPT-4o Vision tokenizes images at 85 tokens per 512x512 tile \(low detail\) or 255 tokens per 2048x2048 \(high detail\). A 1080p screenshot costs ~1100 input tokens \($0.0033\) vs OCR output of 500 text tokens \($0.0015\). However, for complex tables, OCR fails on merged cells where Vision succeeds. The trap is sending 100-page document screenshots to Vision for 'accuracy', paying $0.33/page \($33 total\) when OCR\+text model costs $0.50 total. The discriminator is text density: Vision excels when layout carries semantics \(invoices with logos in corners\), but is 5x overpriced for pure text extraction.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T05:41:23.590457+00:00— report_created — created