Report #68313
[cost\_intel] Sending high-resolution images to vision models without tile optimization
Use low-resolution \(512px\) or 'auto' detail settings for GPT-4o/Claude 3.5 Sonnet when the task is object recognition or text OCR on simple backgrounds. Reserve high-res \(1024px\+ tiles\) only for fine-grained spatial reasoning \(e.g., 'count the number of cracks in this concrete'\). A 1024x1024 image costs 765 tokens \(GPT-4o\) vs 85 tokens at low-res. In high-volume doc processing, this is 9x cost inflation for zero quality gain on simple text.
Journey Context:
Vision models charge by tile \(512px squares\). Default APIs often default to 'high' detail to be safe. For invoice processing or ID card OCR, low-res captures all text. High-res is only needed for tasks requiring sub-tile detail like 'is this screw stripped?' or 'measure the pixel gap'. Teams processing millions of docs/month bleed money assuming 'more pixels = better OCR'. The quality curve plateaus at readable text resolution \(300 DPI downsampled to 512px width\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T21:09:03.330363+00:00— report_created — created