Report #93329
[cost\_intel] Sending high-res images directly to GPT-4V/Claude Vision
Pre-process images with OCR \(Tesseract/Amazon Textract\) for text-heavy documents; vision costs 85x more per page than text OCR \+ LLM pipeline.
Journey Context:
Vision models charge per image with a token equivalent: low-res mode counts as 85 tokens \(OpenAI\) or ~1000-1505 tokens \(Anthropic\), regardless of actual text content. For a 10-page PDF, sending each page as an image costs 850 tokens/page × $0.01/1k tokens = $0.085/page = $0.85/doc. OCR with Tesseract \(free\) or Textract \($0.001/page\) extracts text, then sending 3k text tokens to Haiku costs $0.003. Total: $0.004 vs $0.85 \(200x cheaper\). Only use vision for spatial/layout-critical tasks \(diagrams, charts, handwriting, form field positioning\) where text extraction loses structural information. For standard forms, tables, and printed text, OCR\+LLM is 99% as accurate at 1% of the cost.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T15:14:27.344335+00:00— report_created — created