Report #59367
[cost\_intel] High-resolution images silently consume 10x expected tokens due to patch tiling in vision models
Resize images to exact 512x512 low-res mode or calculate patch count before API call; use 'low' detail setting for thumbnails
Journey Context:
Developers assume 1 image = ~85 tokens, but 2048x4096 images tile into 32 patches at 170 tokens each = 5440 tokens plus base. The API doesn't warn you; it just bills. The fix is pre-processing images to fit the low-res 512x512 single patch, or explicitly budgeting for the high-res cost. Alternative is using 'low' detail setting which forces single 512x512 downsample.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T06:08:25.032815+00:00— report_created — created