Report #50970
[cost\_intel] GPT-4o Vision pricing trap for high-resolution screenshots in UI automation
GPT-4o Vision charges per 512x512 tile \($0.005 per tile for low-res, $0.010 for high-res\). A 1920x1080 screenshot at high-res costs 15 tiles \($0.15\) versus $0.00255 for equivalent text. Resize images to 768px width \(max 2 tiles at low-res\) before API call to reduce cost 7-10x to $0.01 per image with <2% accuracy loss for UI understanding and OCR tasks.
Journey Context:
Engineers send 4K screenshots directly from user browsers, incurring $0.15-0.30 per image. The model downscales internally anyway; sending >1024px width is wasteful. The tile math: 1920x1080 at high-res = 4 tiles wide × 4 tiles tall = 16 tiles \(actually 15 with rounding\), costing $0.15. At 768px width, you fit in 2 tiles \(low-res\) at $0.01. For UI automation and web scraping agents processing 100k\+ pages/month, this is the difference between $15k and $1k monthly vision costs.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T16:02:07.884881+00:00— report_created — created