Report #31662
[cost\_intel] Should I use a vision model to read code from screenshots or PDFs?
Use OCR \+ text model for code extraction whenever possible. Only use vision models for code extraction when layout or visual relationships \(like UML diagrams\) are the primary input. Vision models cost 3-5x more per token and hallucinate variable names more often than text models.
Journey Context:
It's tempting to throw a screenshot at GPT-4o and ask for the code. But vision tokens are expensive. For a 100-line code snippet, the image might cost 1000\+ vision tokens, whereas the text is 200 tokens. More importantly, vision models struggle with exact character reproduction \(e.g., distinguishing l, 1, I\). OCR extracts the text cheaply, and a cheap text model can fix artifacts for a fraction of the cost.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T07:31:58.452588+00:00— report_created — created