Report #94120
[cost\_intel] Vision token bloat analyzing text-heavy UI screenshots
Pre-extract text via OCR \(Tesseract/AWS Textract\) for text-heavy images \(screenshots, documents\) before LLM analysis; reserve vision models for spatial/layout understanding. Cost reduction 10-20x \($0.005 vs $0.05-0.10 per image\) with minimal quality loss on text content.
Journey Context:
Engineers send screenshots directly to GPT-4o vision for UI automation or document analysis, paying 85-170 tokens per low-res image \(512x512\) at $5.00 per 1M input tokens—roughly $0.0004-0.0008 per image just for vision tokens, plus generation. For text-heavy screenshots, 90% of information is extractable via OCR \(Tesseract is free, AWS Textract ~$0.001 per page\). Passing OCR text to GPT-4o text costs ~$0.00001 per 200 tokens vs $0.0008 for vision—a 50-80x reduction on the input side. The degradation signature: OCR loses spatial relationships \('button to the right of text field'\), so pure OCR fails on layout-dependent tasks \('click the red button'\). Hybrid approach: OCR for text content, vision model only for spatial verification or when OCR confidence is low. Cost math: GPT-4o vision low-res = $0.0025 per image \+ generation; OCR \+ GPT-4o text = $0.001 \+ $0.0005 = 3-5x cheaper for text-heavy images, 10-20x cheaper if using free OCR.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T16:34:05.243258+00:00— report_created — created