Report #45580
[cost\_intel] When does Claude 3.5 Haiku match Sonnet on classification tasks vs when does it fail
Use Haiku for binary/multiclass classification with <10 classes and clear label boundaries; force Sonnet when classes >20 or labels are semantically overlapping \(e.g., 'urgent' vs 'high priority'\). Expect 5-10% accuracy drop with Haiku on clean tasks vs 15-20% on ambiguous taxonomies.
Journey Context:
Teams default to Sonnet for all classification due to fear of accuracy loss, but evals show Haiku matches Sonnet on MMLU multiple-choice \(86.6% vs 88.7%\). The cliff appears when classes are fine-grained or imbalanced. Haiku struggles with nuanced distinctions \(e.g., sentiment \+ sarcasm detection\) where Sonnet's reasoning shows 18% higher F1. Cost delta is 10x \($0.25 vs $3 per 1M input tokens\), so misclassification cost must exceed $0.0027 per request to justify Sonnet.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T06:58:44.015339+00:00— report_created — created