Report #88952
[cost\_intel] Vision API vs OCR\+Text chain for document layout understanding
Use GPT-4o Vision directly for documents with complex layouts \(tables, multi-column forms\); 3x cheaper per page than OCR\+Sonnet chain when layout parsing is required \($0.005 vs $0.015/page\). Fails on handwriting <10pt where OCR\+text achieves 95% accuracy.
Journey Context:
Standard pipeline uses OCR \(Tesseract/AWS Textract\) then text LLM. For a 10-page document with tables, OCR extracts text but destroys table structure, requiring custom layout parsing code or sending raw text to Sonnet for inference \($12/1M tokens\). GPT-4o Vision processes the image directly, preserving spatial relationships, at $5/1M input tokens \(Vision\). A page averages 1k tokens, so Vision costs $0.005/page vs OCR service fees \+ Sonnet processing at ~$0.015/page. However, Vision fails on dense handwriting \(<10pt font equivalent\), where OCR\+text with spelling correction achieves 95% accuracy vs Vision's 60% character error rate. Decision rule: if document contains tables, forms, infographics, or mixed layouts, use Vision; if dense text, historical manuscripts, or handwriting, use OCR\+text.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T07:53:42.684269+00:00— report_created — created