Report #90077
[cost\_intel] How does low-resolution vs high-resolution vision preprocessing affect total token count by an order of magnitude in multimodal pipelines?
Force 'low\_res' mode \(512px square, 85 tokens\) for document OCR and UI element detection; use 'high\_res' \(1024px\+, 170-4000\+ tokens depending on tile count\) only for fine-grained visual reasoning \(medical imaging, chart typography\); never send raw 4K screenshots to frontier vision models without tiling controls.
Journey Context:
Developers pipe raw screenshots \(1920x1080\) into GPT-4o assuming automatic downscaling. Instead, models tile large images into 512px squares, each costing 170 tokens. A 1080p screenshot becomes 8-12 tiles, costing 2000\+ tokens \($0.01/image\) vs. low-res 85 tokens \($0.0004\). For agents processing 10k screenshots/day \(automated UI testing\), this is $100/day vs $4/day. The failure mode is missing small text; the fix is using low-res for layout detection, then cropping specific regions for high-res OCR only when needed \(two-step pipeline\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T09:47:19.977986+00:00— report_created — created