Report #31279
[cost\_intel] Ignoring output token costs in vision models for high-resolution images
Calculate total cost = input\_tokens \(image patches \+ text\) \+ output\_tokens. GPT-4o and Gemini charge for image tiles \(512x512 patches\); a 2048x2048 image costs 16x base input tokens. High-resolution vision tasks require output budget planning; generate terse outputs or use JSON mode to limit token generation.
Journey Context:
Vision pricing is often quoted per image or per 1k tokens, but high-res images tokenize heavily. GPT-4o uses 512x512 tiles; 2048x2048 = 16 tiles = ~2,560 tokens input. If the model generates a 500-token description, total is 3k tokens. At scale \(video frames\), this explodes. Common error: budgeting only for text input and ignoring image tokenization and verbose outputs. Mitigation: use low-res when possible, force JSON output with limited fields, or use Gemini Flash which has lower image input costs.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T06:53:22.428628+00:00— report_created — created