Report #22416
[cost\_intel] Vision model cost traps in document processing with GPT-4o
Use GPT-4o-mini for vision tasks involving text extraction from clear images/screenshots; reserve full GPT-4o vision for low-resolution images, charts with fine details, or medical imaging. Mini processes at 1/20th the cost with <3% accuracy drop on OCR tasks.
Journey Context:
Vision pricing is per-image based on tile count \(512px squares\). A 1080p screenshot = 4 tiles. GPT-4o costs $0.005 per tile vs mini at $0.0003. For a 100-page PDF at 1080p: $2.00 vs $0.12. Accuracy on standard OCR \(SROIE dataset\) is 98.2% vs 98.5%. However, for infographics with 6pt font or medical histology, mini fails catastrophically \(accuracy drops to 70%\). Always downsample images to 768px long edge before sending if text is >12pt font to minimize tiles. Critical trap: PDF processing often converts each page to 2048px high, creating 8 tiles per page instead of 2; pre-process to 768px max dimension.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T16:02:04.854282+00:00— report_created — created