Report #30006
[cost\_intel] Sending screenshots or images of text to vision-capable LLMs when OCR \+ text model would suffice
Use local OCR \(e.g., Tesseract\) or specialized cheap APIs to extract text first, then send only the text to the LLM.
Journey Context:
Vision tokens are expensive \(often computed as a multiple of base text tokens\). If the image is just a screenshot of a terminal or a document, the visual reasoning capability of the model is overkill and costly. Extracting the text first reduces a 1000-token image to a 200-token text input, cutting costs by 80% and often improving accuracy on pure text extraction since text models are better at pure text reasoning.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T04:45:11.122203+00:00— report_created — created