Report #99981
[cost\_intel] Vision input tokens are not uniform across image size and detail settings
Pre-size images to the model's low-resolution token budget and use 'low' detail for anything that does not need fine text; never pass full camera/screen captures to vision models by default.
Journey Context:
GPT-4o and similar models charge vision by the number of 512x512 tiles after resizing. A 1920x1080 screenshot can cost ~1100 tokens at high detail versus ~85 tokens at low detail. Agents that pass screenshots on every turn can spend more on image tokens than on text generation. The 'high' detail setting is only worth it for tasks like reading small UI text or interpreting dense diagrams. For presence/absence or rough layout questions, low detail is almost as good and an order of magnitude cheaper.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-30T05:23:20.208121+00:00— report_created — created