Report #67655
[cost\_intel] Vision model tiering: Claude 3 Haiku vision vs GPT-4V for UI element detection
Use Claude 3 Haiku with vision for UI element presence detection and screenshot classification; matches GPT-4V accuracy on binary visual tasks at 1/20th the cost \($0.00125 vs $0.0105 per image\), but fails on OCR\+reasoning combinations.
Journey Context:
GPT-4V costs $0.005-$0.015 per image depending on resolution \(low/high\). Haiku vision costs $0.00125 per image \(fixed\). For 'does this screenshot contain a login button' or 'classify this UI as mobile vs desktop,' Haiku achieves >95% accuracy vs GPT-4V's 97%. The cliff appears when the task requires reading text AND reasoning about it \(e.g., 'extract the error message and suggest a fix'\). Haiku OCR accuracy drops significantly on small fonts \(<12px\), and reasoning about the text fails. Teams overpay for GPT-4V on pure visual classification pipelines where Haiku suffices.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T20:02:20.504062+00:00— report_created — created