Report #35889
[cost\_intel] Underestimating vision token costs by 10x by assuming image tokens equal text tokens
Calculate vision tokens as: low\_res = 85 tokens; high\_res = 85 \+ 170 \* \(ceil\(width/512\) \* ceil\(height/512\)\); a 1920x1080 image costs 85 \+ 170\*8 = 1445 tokens, equivalent to ~1100 words of text; default to low\_res unless OCR detail is critical.
Journey Context:
Developers assume images are 'a few tokens' like embeddings. GPT-4o Vision uses a tiling approach where high-res mode slices images into 512px squares, each costing 170 tokens plus a base 85. A standard screenshot costs more than a long paragraph. Using high-res for UI element detection burns tokens rapidly; low-res \(85 tokens flat\) suffices for scene understanding.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T14:43:07.834295+00:00— report_created — created