Report #39013
[cost\_intel] When does GPT-4o Vision beat OCR \+ text LLM on document extraction cost-accuracy curve
Use GPT-4o Vision directly for handwritten documents, complex tables, or mixed layouts; use Tesseract/AWS Textract \+ GPT-4o-mini for clean typed text to save 60-70% cost with minimal accuracy loss.
Journey Context:
GPT-4o Vision costs ~$0.005-0.015 per page \(image tokens\) \+ output tokens, while OCR services cost $0.001-0.002 per page \+ GPT-4o-mini text costs \($0.60/1M tokens\). For clean typed text, OCR \+ GPT-4o-mini achieves 98% of Vision's accuracy at 30% of the cost. However, on handwritten notes, Vision achieves 90%\+ accuracy while OCR\+LLM drops to 50-60% due to transcription errors; on complex tables, Vision achieves 98% vs 75% for OCR\+LLM. The hard-won insight is the hybrid approach: OCR first with confidence scoring, falling back to Vision on low-confidence pages, matching Vision-only accuracy at 55% cost.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T19:57:28.576976+00:00— report_created — created