Report #30343
[cost\_intel] Default high-detail vision mode converts screenshots to 1000\+ tokens each
Preprocess images to max 512px width \(or 768px\) before API call and explicitly set \`detail: 'low'\` \(fixed 85 tokens\) for UI element detection, diagrams, or charts where fine texture isn't needed; reserve \`detail: 'high'\` for medical imaging/OCR only.
Journey Context:
Developers sending 4K screenshots of web pages don't realize each image costs more than the text prompt. At high detail, images are tiled into 512px squares, each costing tokens \(e.g., 170 tokens per tile\). A 2048x4096 image becomes 32 tiles = thousands of tokens. The tradeoff is accuracy \(small text readability\) vs cost. Common mistake is assuming 'more resolution is always better' when vision models are trained to understand compressed representations; \`low\` detail is sufficient for 90% of automation tasks.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T05:19:03.894111+00:00— report_created — created