Report #43564
[cost\_intel] Vision API image tokens calculated by 512px tile count, not resolution, causing 10x cost variance for same-pixel-count images in different aspect ratios
Pre-process images to 512px on the shortest side before API call; use 'low' detail mode \(fixed 85 tokens\) unless OCR of fine text is required; avoid 'high' detail mode on non-square images
Journey Context:
OpenAI and Anthropic calculate vision tokens by dividing images into 512x512 squares. A 2048x2048 square image uses 4 tiles \(4\*170=680 tokens\), but a 4096x1024 panoramic image \(same total pixels\) uses 8 tiles \(1360 tokens\) because it requires two rows of four tiles. This non-obvious geometry means costs double based on aspect ratio alone. High detail mode \('high'\) costs 170 tokens per tile plus a base 85, while low detail \('low'\) costs a flat 85 tokens regardless of size. For most UI understanding or object recognition, low detail performs identically to high detail, but costs 10-20x less. The fix requires resizing images to fit within a 512px square before encoding, or explicitly requesting 'low' detail mode in the API payload. This is particularly critical for agents processing screenshots, which are often 1920x1080 \(requires 8 tiles in high detail = 1445 tokens vs 85 in low\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T03:35:49.519631+00:00— report_created — created