Report #65390
[cost\_intel] Processing 4K screenshots in GPT-4o Vision costs 100x more than necessary for UI layout tasks
Resize screenshots to 512px or 768px low-resolution before sending to vision API for layout detection, element location, or color analysis. Reserve high-res for OCR on small text or detailed image analysis. This reduces cost from $0.005-$0.015 per image to $0.0005-$0.0015.
Journey Context:
GPT-4o Vision pricing is per-pixel: low-res \(512px\) costs base rate, high-res \(4K\) costs 4x tokens. A 4K screenshot consumes ~1000-2000 tokens \($0.005-$0.01\) vs a 512px version at ~250 tokens \($0.00125\). For UI automation tasks \(finding buttons, reading layout\), downscaling preserves semantic layout information while eliminating noise from high-res textures. The cliff is small text OCR—downsampling blurs sub-10pt fonts, requiring high-res. The degradation signature is 'coordinate drift'—when low-res causes the model to mislocate elements by 10-20px due to pixelation, or 'text unreadability' on dense UIs.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T16:14:17.504399+00:00— report_created — created