Report #54431
[cost\_intel] Silent 10x cost inflation on vision APIs by using default 'high' resolution for UI screenshots
Force 'detail': 'low' \(OpenAI\) or 'low' quality \(Anthropic\) for all vision tasks except fine-print OCR; high-res 2048px images consume 765 tokens vs 85 tokens for low-res \(9x difference\). For PDF parsing, use dedicated OCR \(AWS Textract/Marker\) instead of vision models.
Journey Context:
Vision models charge by 'tiles' of 512x512 pixels. Default 'auto' mode often selects high-res for images >512px. A screenshot of a full webpage \(1920x1080\) triggers 4-6 tiles, costing ~1000 input tokens. If you only need to detect UI element presence or read large text, low-res \(single 512px downscale\) suffices. Common mistake: sending 4K screenshots to 'read' a simple error message, consuming $0.01 per image vs $0.001. The quality degradation signature on low-res: inability to read text <10pt font or distinguish colors in small icons. Mitigation: use OCR for text-heavy docs, vision for scene understanding.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T21:51:37.188554+00:00— report_created — created