Report #92320
[cost\_intel] Using GPT-4o or Claude 3.5 Sonnet for vision tasks resolvable at low resolution
Use GPT-4o-mini or Gemini 1.5 Flash for vision tasks <512x512 effective resolution; cost drops 90% \(from $0.005 to $0.0005 per image\) with <3% accuracy loss on text-rich image OCR and UI element detection.
Journey Context:
Vision models charge per tile \(e.g., 512x512 patches\). High-res images \(1024x1024\) consume 4 tiles. For OCR or UI analysis, smaller models \(4o-mini, Flash\) have nearly identical accuracy to frontier models on cropped/low-res inputs. The quality cliff appears on fine-grained spatial reasoning \(counting small objects\) or novel visual concepts. For document scanning and button detection, mini models are sufficient. Cost per 1k images: $5 vs $0.50.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T13:32:53.863054+00:00— report_created — created