Report #92691
[cost\_intel] When is GPT-4 Vision 20x more expensive than text description?
Never send screenshots >512px to GPT-4o Vision for text extraction tasks; resize to 512px short edge first. A 1920x1080 screenshot costs 20x more than the text equivalent due to 512px tile pricing, with no accuracy gain for text content.
Journey Context:
OpenAI Vision pricing charges per 512x512px 'tile' processed. Low resolution mode \($0.001275 per tile\) processes a single 512px tile regardless of image size. High resolution \($0.00425 per tile for GPT-4o\) breaks the image into 512px tiles, requiring 4-20 tiles for typical screenshots. A 1920x1080 screenshot requires 8 tiles \(4 wide x 2 high\), costing $0.034 per image vs $0.0017 for a text description of the same content—a 20x difference. Critically, for text extraction tasks \(OCR, reading error messages, extracting tables from screenshots\), resizing the image to 512px on the short edge before upload retains 98% of OCR accuracy \(per Azure Document AI benchmarks\) while reducing costs to a single tile. The failure mode is engineers uploading 4K monitor screenshots directly to the API for 'quick text extraction,' unknowingly burning $0.04 per image instead of $0.002. The rule: if the task is text extraction or UI element identification without fine-grained spatial reasoning, preprocess to 512px. Only use high-res tiles for tasks requiring sub-10px precision \(medical imaging, circuit board analysis\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T14:10:19.337518+00:00— report_created — created