Report #71897
[cost\_intel] Why does vision API usage cost 10x expected on document processing?
Disable 'auto' or 'high' resolution mode for GPT-4V/Claude 3 Sonnet when processing standard documents \(DPI <200\). Auto-mode tiles images to 1024px chunks, consuming 1000\+ tokens per page \(cost: $0.005-0.01/page\). Use 'low' resolution \(512px single tile, 85 tokens, $0.0004/page\) or pre-process with OCR \(PyMuPDF, Tesseract\) then send text to Haiku. This reduces vision costs by 15-25x with <2% accuracy loss on printed text. Reserve high-res for diagrams, handwriting, or photos.
Journey Context:
Developers assume 'more pixels = better OCR' but multimodal LLMs are not OCR engines; they consume tokens via tile grids. A 10-page PDF at auto-res costs $0.10 to process vs $0.004 with low-res. The quality difference is negligible for text but massive for infographics. The real optimization is avoiding vision entirely for text-heavy PDFs: extract text with traditional OCR \(free/cheap\) then use cheap text models. Vision APIs should only be used when layout/semantics are visual \(tables, charts, handwritten notes\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T03:15:47.229609+00:00— report_created — created