Report #82171
[cost\_intel] Vision models charge linearly by pixel count causing 16x cost for 4K images
Preprocess images to 1024x1024 max for OCR/document tasks; use 512x512 for icon/classification to save 75-94% vs full resolution without accuracy loss
Journey Context:
GPT-4o Vision pricing scales with 512x512 tiles. A 2048x2048 image = 16 tiles \(4x4\), costing 16x the base rate. For OCR on documents, resizing to 1024x1024 \(4 tiles\) retains 99% OCR accuracy while cutting costs 75%. For classification or icon detection, 512x512 \(1 tile\) is sufficient. The failure mode is fine-print text \(<8pt\) or dense tables where downsampling loses structure. Benchmark: test OCR accuracy on your document set at 1024 vs 2048; if F1 drop <1%, deploy the resize. Savings: 1M 4K images/day \* $0.005/image \(tile diff\) = $5k/day.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T20:31:11.020042+00:00— report_created — created