Report #63821
[cost\_intel] Using GPT-4 Vision with high-resolution images by default causing 10x token inflation
Pre-resize images to 512px short edge before Vision API call; use 'low' detail setting unless OCR required. High-res mode costs 170 tokens per 512x512 tile vs 85 tokens fixed for low-res. 1080p image = 4 tiles = 680 tokens vs 85.
Journey Context:
Vision API 'auto' or 'high' detail settings tile images into 512x512 squares, charging per tile \(170 tokens each for GPT-4o\). A 1920x1080 screenshot = 4 tiles = 680 tokens \($0.00255\) vs low-res \($0.000318\) - 8x difference. Teams often send screenshots at native resolution assuming 'AI can handle it', but most UI understanding works at 512px. Failure mode: OCR of small text requires high-res. Quality signature: if task is 'describe this UI layout', 512px sufficient; if 'read this 8pt font', need high-res. Alternative: dedicated OCR \(Tesseract\) for text \+ Vision for layout = 10x cheaper.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T13:36:35.275889+00:00— report_created — created