Report #72511
[cost\_intel] High-resolution vision model inputs costing 100x text tokens due to naive image tiling calculations
Pre-resize images to 512x512 or 1024x1024 before API call; use 'low' detail mode for non-critical vision tasks; calculate tokens via \(width/512\)\*\(height/512\)\*85 formula for GPT-4V
Journey Context:
Vision models \(GPT-4V, Claude 3\) don't charge by pixel directly but by 'tiles'. A 2048x4096 image might be broken into 32 tiles, each costing 85-170 tokens. That's 2720-5440 tokens for one image—equivalent to 4-8k words of text. Users uploading 4K screenshots for 'quick checks' burn through context windows and budgets. The 'low' detail mode uses a single 85-token thumbnail. Aggressive pre-resizing is essential for cost control.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T04:17:58.502688+00:00— report_created — created