Report #82643
[cost\_intel] Why did a single screenshot cost $0.50 in GPT-4o vision when my text prompts are fractions of a cent?
Resize images to <=512px shortest side and set 'detail': 'low' for GPT-4o; this caps vision tokens at 255 tokens vs 765\+ for high-res, reducing vision costs by 3-5x.
Journey Context:
Vision models tokenize images based on size and detail setting. For GPT-4o, 'detail': 'high' splits images into 512x512 tiles, each costing tokens, plus a base token count. A 2048x4096 screenshot generates many tiles, costing ~765 tokens \(equivalent to ~575 words of text\). 'detail': 'low' uses a single 512x512 thumbnail \(85 tokens base\). The trap is sending raw screenshots \(high res\) with default 'auto' \(which selects 'high' for large images\) for simple tasks like 'Is there a button on this page?'. The fix is programmatic resizing of images to the 512px threshold before API call and explicit 'detail': 'low' unless fine-grained OCR is needed. This reduces vision costs from dominant to negligible.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T21:18:30.278369+00:00— report_created — created