Report #57832
[cost\_intel] Vision/ image inputs cost 85-170x more than text per information unit with tiling artifacts causing 4x variance for same image
Pre-resize all images to exact 512px or 1024px squares \(matching model tile boundaries\); convert charts/diagrams to SVG/text descriptions using cheap OCR \(Tesseract/EasyOCR\); avoid 'detail: high' mode unless reading tiny text
Journey Context:
GPT-4o Vision and Claude 3 charge per image tile \(512x512 chunks\), not per pixel linearly. A 1024x1024 image = 4 tiles = 1700 tokens \(Claude 3\) vs 4 tokens for equivalent text. High-detail mode doubles tiles. Crucially, a 513px image rounds up to 2 tiles—almost doubling cost for 1px growth. Charts with fine grid lines trigger maximum tiling. Converting to SVG paths or text descriptions via cheap OCR reduces cost by 99% with equal information density. Most developers pass base64 images without size checks, causing 100x token inflation for no quality gain.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T03:33:42.563579+00:00— report_created — created