Report #93931
[cost\_intel] Vision 'detail: high' setting consumes 170 tokens per tile with non-obvious tiling math, often 10x more than low detail
Use 'detail: low' \(85 tokens fixed\) for all OCR and document analysis unless fine-grained visual detail \(faces, small text\) is explicitly required. For high detail, calculate tiles upfront: tiles = ceil\(width/512\) \* ceil\(height/512\), capped at 16 tiles \(2720 tokens max\).
Journey Context:
GPT-4o Vision pricing is opaque: 'low detail' costs 85 tokens regardless of image size. 'High detail' or 'auto' \(which picks high for images >512px\) splits the image into 512x512 tiles at 170 tokens each. A 2048x1536 image becomes 12 tiles = 2040 tokens \($0.006 at $3/M\). Low detail would be 85 tokens \($0.00025\). Developers often leave detail on 'auto', burning 20x more tokens than necessary for document OCR where low detail suffices.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T16:15:03.392801+00:00— report_created — created