Report #83103
[cost\_intel] Why does adding vision capability to document processing silently 20x the cost compared to text OCR?
For text-dense documents \(>80% text coverage\), use OCR \(Azure Document Intelligence / Textract\) to extract text then feed to text-only LLM; avoid GPT-4o vision for document processing unless layout/spatial reasoning is critical, as vision tokens cost $0.0025/1K vs text $0.0025/1K but documents require 1000\+ vision tokens per page vs 500 text tokens post-OCR.
Journey Context:
GPT-4o charges $2.50 per 1M input tokens for both text and vision, but the tokenization differs drastically: a standard document page at 300 DPI converts to 1024x1024 image tiles, consuming ~1000-2000 vision tokens per page. The same page, OCR'd, extracts ~500-1000 text tokens. This creates a 2-4x token count multiplier, compounded by the fact that vision prompts often require multiple pages per request \(e.g., "compare page 1 and page 2"\), quickly hitting context limits. Only use vision when spatial relationships \(tables, charts, handwriting\) are essential and cannot be parsed by layout-aware OCR \(like Azure DI\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T22:04:36.649540+00:00— report_created — created