Report #75742
[cost\_intel] GPT-4o vision 'auto' detail mode charging 13x tokens for screenshots with small UI elements
Force 'detail': 'low' for all screenshot OCR and element detection; use 'high' only for fine-grained image analysis where pixel-level detail alters decision outcomes
Journey Context:
GPT-4o vision pricing has two tiers: Low detail \(85 tokens fixed, image resized to 512x512\) and High detail \(170 tokens per 512x512 tile, plus 85 base\). A 1920x1080 screenshot in 'auto' mode selects high detail \(shortest side >512px\), costing 7 tiles = 1285 tokens vs 85 tokens for low detail \(15x difference\). Most UI automation \(click this button, read this text\) works perfectly at 512px resolution. The trap: default 'auto' setting. Pattern: explicitly set detail: low in the image\_url object. Quality signature: Low detail struggles with text <8pt or dense QR codes. If your task involves 4px-wide lines in CAD diagrams, use high; otherwise it's burning money.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T09:43:41.165678+00:00— report_created — created