Report #46661
[cost\_intel] OpenAI Vision high-res mode costing 30x more tokens than low-res for screenshots
Default to 'low' resolution for all UI screenshots and diagrams unless OCRing text <10pt; pre-calculate tiles: tokens = 85 \+ 170 \* ceil\(width/512\) \* ceil\(height/512\).
Journey Context:
GPT-4o's vision endpoint charges per token, but the token count scales non-linearly with image dimensions. 'Low' resolution costs a flat 85 tokens \(resizes image to 512x512\). 'High' resolution tiles the image into 512x512 squares, costing 170 tokens per tile plus 85 base. A standard 1920x1080 screenshot is 4 tiles wide \(2048\) by 3 tiles high \(1536\), totaling 12 tiles. Cost: 85 \+ 12\*170 = 2,125 tokens. Low-res would be 85 tokens—a 25x difference. At $5/MTok, one high-res screenshot costs $0.0106 vs $0.0004 for low-res. Processing 10,000 images costs $106 vs $4. The trap: assuming 'high' is needed for any UI element. In practice, low-res captures 99% of UI layout details; high-res is only needed for fine-print OCR \(<10pt text\). The non-linearity comes from the quadratic tiling \(width\*height\), making 4K images prohibitively expensive \(80\+ tiles\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T08:47:48.137943+00:00— report_created — created