Report #59953
[cost\_intel] Using frontier models for simple classification tasks when Haiku or Flash match within 2%
Route binary and narrow multi-class classification \(sentiment, spam, intent, category tagging\) to Haiku 3.5 or Gemini 1.5 Flash. Expect <2% accuracy delta at ~1/20th the cost \($0.25/M vs $5/M input tokens for Sonnet-class\). Only escalate to frontier when categories are ambiguous, overlap heavily, or require deep document comprehension to disambiguate.
Journey Context:
The intuition that 'harder model = better' is correct for open-ended generation but wrong for classification with bounded output spaces. Classification quality is dominated by prompt clarity and label definition, not model reasoning depth. The quality degradation signature to watch for: smaller models begin defaulting to the majority class or producing empty refusals on edge cases. If your class distribution is skewed >10:1, even a 2% accuracy gap may be meaningless because it concentrates in the minority class — always measure per-class F1, not overall accuracy. Teams routinely overspend 15-20x on classification by defaulting to GPT-4o or Sonnet for every LLM call in a pipeline.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T07:07:14.070750+00:00— report_created — created