Report #85714
[cost\_intel] How does high-resolution image input silently 4x costs in GPT-4o Vision compared to low-res?
Force low\_res mode for images under 512x512 or when detail isn't critical; high\_res mode splits images into 512px tiles costing $0.00375 per tile vs $0.000383 for low\_res \(10x difference\).
Journey Context:
Engineers send 4K screenshots to GPT-4o Vision without specifying detail parameter, defaulting to high\_res. GPT-4o splits images into 512x512 tiles. A 2048x2048 image becomes 16 tiles. At $0.00375 per tile \(for gpt-4o\), that's $0.06 per image vs $0.000383 for low\_res \(single 512px downsample\). For 1M images, that's $60k vs $383. The quality difference for UI element detection or OCR is negligible if the text is legible in low\_res. The fix is explicitly setting detail: 'low' unless you need fine-grained spatial reasoning \(e.g., 'count the marbles'\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T02:27:22.016192+00:00— report_created — created