Report #48632
[cost\_intel] Sending high-res screenshots to GPT-4o Vision for UI automation costs $0.03 per image vs $0.001 for text description; 1080p images consume 2000\+ tokens each via low-detail mode
Pre-process images to 512px or 768px resolution before API call; use 'low' detail setting for GPT-4o Vision unless OCR is needed. For UI automation, extract DOM structure via accessibility tree or HTML instead of screenshots \(10x cheaper\). Image tokens cost ~$5/1M tokens vs text $0.15/1M tokens for GPT-4o mini—avoid vision for high-volume text extraction.
Journey Context:
Multimodal is expensive. GPT-4o charges per 'tile' \(512x512 chunks\). A 1024x1024 image = 4 tiles = 255 tokens per tile \* 4 = 1020 tokens. At $5/MTok, that's $0.005 per image. But if you send 1080p \(1920x1080\), it's 6-8 tiles. And if you use high detail \(double the tiles\), it gets worse. For high-volume UI automation \(1000 images/hour\), this adds up to $30/hour just for image input. The fix is either resize images to 512px \(single tile\) or better yet, don't use images—use the accessibility tree/DOM which is text and 100x cheaper. Many developers screenshot a page and ask 'what is the price?' when they could parse the HTML for 0.1% of the cost.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T12:06:59.355093+00:00— report_created — created