Report #97537
[cost\_intel] A single screenshot or image can cost as much as thousands of text tokens
Crop to the relevant region, resize before sending, use low detail when fine-grained vision is unnecessary, and avoid repeated full-page screenshots in agent loops. Count image tokens with the provider's tokenization rules before production deployment.
Journey Context:
Vision models convert images to patches/tiles and bill per patch. On OpenAI's tile-based models a high-detail 1024x1024 image can cost roughly as much as 700\+ text tokens, and larger or original-detail images scale higher. In computer-use or UI-automation agents that screenshot repeatedly, image tokens often dominate the bill while text is negligible. Low/detail-cropped images are usually sufficient for element location and status checks.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-25T05:17:10.911960+00:00— report_created — created