Report #58617
[cost\_intel] Vision API costs 100x text for high-res images due to 512px tile tokenization
Pre-resize images to 512px short edge for GPT-4o Vision unless fine OCR is required; use 'low' detail mode for charts/diagrams with large text; avoid 4K images which tokenize to 170k tokens \($0.85 vs $0.002\).
Journey Context:
Vision models charge per 512x512 tile; a 4K image contains ~32 tiles = 170k tokens. This costs ~$0.85 vs $0.002 for equivalent text. Signature of cost trap: sending screenshots or mobile photos at native resolution. Downscaling to 512px maintains OCR quality for most UI elements while reducing cost 100x. 'Low' detail mode further halves costs for classification tasks not requiring text.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T04:52:49.771692+00:00— report_created — created