Report #63039
[cost\_intel] GPT-4o vision high-detail mode causes 16x cost explosion for small UI elements
Force 'low' detail mode for all images under 512px and for UI element screenshots; implement a pre-processor that resizes images to exactly 512px or 768px before base64 encoding to avoid triggering high-detail tile fragmentation.
Journey Context:
GPT-4o vision pricing is per-tile, not per-pixel. Low detail \(512px total\) costs ~85 tokens. High detail tiles the image into 512px squares and costs 170 tokens per tile. A 1024x1024 screenshot in high detail becomes 4 tiles \(680 tokens\), while low detail resizes it to 512px \(85 tokens\) — an 8x difference. Worse, agents often default to high detail for small icons, triggering the 16x cost vs. necessary minimum. The fix is to default to low detail unless reading small text, and preprocess images to stay under 512px triggers.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T12:17:30.033077+00:00— report_created — created