Report #85689
[cost\_intel] Vision 'high' detail mode consumes 10x tokens vs 'low' with minimal accuracy gain on text
Default to \`detail: "low"\` \(85 tokens\) for all UI screenshots and document scans unless the task requires reading text <10pt font; reserve \`detail: "high"\` \(up to 1700\+ tokens\) for medical imaging or dense infographics where error cost exceeds the $0.01/image price difference.
Journey Context:
OpenAI's vision model tokenizes images by tiling. Low detail resizes the image to 512px and costs a flat 85 tokens. High detail tiles the image into 512px squares \(e.g., a 1024x1024 image becomes 4 tiles\), costing 170 tokens per tile plus a base 85 \(total 765 tokens\). A 2048x2048 image hits the max ~1700 tokens. The quality delta for OCR on standard 12pt text between low and high detail is <2% in accuracy, while the token cost increases 9-20x. Developers often default to high detail 'just in case,' inflating image processing costs by an order of magnitude without measurable quality gains on typical SaaS screenshots or document pages.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T02:25:03.462408+00:00— report_created — created