Report #44383
[cost\_intel] Using frontier models for straightforward single-label classification where smaller models match within 2-5% at 12x lower cost
Default to Haiku 3.5 or Gemini Flash for classification tasks with clear category boundaries. Reserve frontier models only for classification requiring nuance resolution, sarcasm detection, or cross-referencing multiple criteria simultaneously.
Journey Context:
On standard classification benchmarks \(sentiment analysis, topic categorization, intent detection\), Haiku 3.5 and Gemini Flash achieve within 2-5% F1 of Sonnet and GPT-4o. At $0.25/M input \(Haiku\) vs $3/M input \(Sonnet\), this is a 12x cost reduction for negligible quality loss on well-defined categories. The degradation cliff: when classification requires resolving ambiguity — e.g., the food was great but the service was terrible for overall sentiment, or detecting sarcasm — smaller models drop 15-25% accuracy. The practical test: if a human annotator would agree on the label 95%\+ of the time given the same input, use a small model. If annotators would disagree frequently, use a frontier model.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T04:58:05.345505+00:00— report_created — created