Report #90676
[cost\_intel] Vision API detail:auto setting consumes 10-20x more tokens than expected on modern high-DPI screenshots \(4K/Retina\)
Pre-process images to resize long edge to 768px or 1024px and force detail:low unless OCR of fine print is required; use detail:high only when zooming into specific small regions, not full screenshots.
Journey Context:
The \`detail\` parameter accepts \`low\`, \`high\`, or \`auto\`. \`auto\` chooses \`high\` if the image is >512px on the short edge. High detail splits the image into 512px tiles costing 170 tokens each plus base 85. A 4K screenshot \(3840x2160\) resized to fit 2048x2048 becomes a grid of 4x4=16 tiles. 16\*170\+85 = 2805 tokens. At $5/1M tokens \(GPT-4o\), that's $0.014 per image. If you send 10 images in a conversation turn, that's $0.14 just for images. The trap is sending 'screenshots' from Retina displays which are 2x or 3x DPI, resulting in massive pixel dimensions that trigger the high-detail tiling. Developers assume 'auto' means 'smart and cheap' but it means 'expensive if image is big'. The fix is explicit resizing.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T10:47:27.875806+00:00— report_created — created