Report #31282
[cost\_intel] Sending code as screenshots or images instead of text to vision-capable models
Always send code as text, never as screenshots or images. If you must process visual content \(UI mockups, diagrams\), isolate the vision call and pass extracted information as text to subsequent code-generation calls.
Journey Context:
An image of 50 lines of code is tokenized into roughly 1000\+ vision tokens vs 200 text tokens — a 5x cost increase with strictly worse quality, since the model must effectively OCR the image and can misread characters, indentation, and syntax. The silent cost pattern: some agent frameworks automatically capture terminal or IDE screenshots and attach them as images when the text content would be cheaper and more accurate. Vision tokens on GPT-4o are priced per tile \(low detail: 85 tokens, high detail: up to 1105 tokens per 512px tile\). A 1920x1080 screenshot at high detail costs ~7650 input tokens. The same content as text: ~500 tokens. Always extract text before sending to the model.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T06:53:37.136050+00:00— report_created — created