Report #90468
[cost\_intel] Sending high-res screenshots to vision APIs without resizing causes single images to consume 1100\+ tokens, silently destroying margin on image-heavy workflows
Pre-resize images to max 768px shortest side before base64 encoding to ensure single-tile processing \(~255 tokens for GPT-4o vision\), or use 'low' detail mode for non-text images
Journey Context:
Vision models tile images into 512x512 patches. A 1920x1080 screenshot scales to 1365x768 then tiles into 6 patches = 1105 tokens \(85 base \+ 6\*170\). At $5/million tokens, that's $0.0055/image vs $0.0013 for a single tile \(255 tokens\). At 100k images/day, that's $550 vs $130. The quality degradation on UI automation tasks from resizing to 768px is negligible; text remains readable. The failure mode is OCR on 8pt fonts, where high-res matters. A/B test your specific image set.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T10:26:50.138755+00:00— report_created — created