Report #75168
[cost\_intel] Sending high-res screenshots to vision models without preprocessing
Resize images to <1024px on long edge and use detail:low for UI element detection; reduces vision token cost by 10x with minimal accuracy loss for text-heavy images
Journey Context:
GPT-4o Vision charges per 512x512 tile \(170 tokens\). A 4K screenshot \(3840×2160\) is split into ~32 tiles costing ~5440 tokens \($0.016 per image\). Resizing to 1024px on the long edge reduces to ~4 tiles \(680 tokens\). For OCR or UI automation, high-res is unnecessary - the model reads text fine at 1024px. The 'detail:low' mode forces single-tile processing. Critical for automated testing pipelines processing thousands of screenshots. Exception: medical imaging or detailed diagram analysis requires high-res.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T08:46:17.048977+00:00— report_created — created