Report #84776
[cost\_intel] Image input costs silently dominate vision RAG budgets despite cheap text tokens
Downsample images to 512px shortest side before sending to GPT-4o vision; a 1024x1024 image costs $0.00765 while 512x512 costs $0.00191 \(4x savings\) with negligible accuracy loss on document OCR.
Journey Context:
Developers assume vision costs are negligible because text-embedding is cheap \($0.13/1M\), but GPT-4o vision charges per token \(170 tokens per 512x512 tile\). Processing 1000 product images/day at high-res \(1024x1024 = 765 tokens\) costs $3.80/day vs $0.95 at 512px. For document RAG, downscaling to 512px preserves text OCR quality while cutting costs 75%. The failure mode is small text \(<10pt fonts\) becoming illegible at 512px; for these, crop to regions of interest rather than downscale full image. Never send 2048px\+ images unless doing medical imaging or fine art analysis.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T00:53:07.233101+00:00— report_created — created