Report #51281
[cost\_intel] Unexpectedly high token usage when processing screenshots or images in multimodal workflows
Pre-resize images to low-resolution mode \(512px short side\) before API submission; GPT-4o charges per 512px or 768px tile, and a 4K screenshot can consume 1000\+ tokens \(~$0.015-$0.03\) versus ~85 tokens when resized.
Journey Context:
Developers assume images are 'flat rate' or cheap compared to text, but vision models tokenize by splitting images into patches/tiles. GPT-4o uses 512x512 tiles at ~170 tokens each \(or 768x768 at higher detail\). A standard 1920x1080 screenshot processes as 4-6 tiles \(680-1020 tokens\), costing $0.005-$0.01 per image at GPT-4o rates. In UI automation loops \(e.g., screenshot → action → screenshot\), this 10x's costs compared to text-only DOM extraction. The fix is aggressive resizing to 'low' detail mode \(512px short side limits tiles to 1-2\), or using SVG/DOM extraction instead of raster screenshots.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T16:33:51.624275+00:00— report_created — created