Report #72332
[cost\_intel] High-resolution vision mode multiplies image token costs 10-15x vs low-res for marginal OCR gains
Default to low\_res \(85 tokens per image\) for all UI element detection and icon recognition; only use high\_res when explicitly extracting small text \(<10pt font\) or dense tables, and pre-crop images to the relevant region to minimize tile count.
Journey Context:
GPT-4o Vision uses a tiling mechanism: low\_res mode costs 85 tokens regardless of size; high\_res costs 85 tokens for the base 512x512 tile plus 170 tokens for each additional 512x512 tile. A 2048x2048 screenshot in high\_res mode consumes 85 \+ \(170 \* 15\) = 2635 tokens—equivalent to 2000\+ words of text. Developers enable 'high' detail for all images assuming it improves general accuracy, but for most GUI automation and object detection, low\_res is visually identical and 30x cheaper. The trap is conflating image resolution with token cost linearity; it's step-function explosive. The alternative of always using high\_res burns budget on image-heavy workflows.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T03:59:50.845652+00:00— report_created — created