Report #93180
[cost\_intel] Over-provisioning frontier models for simple classification tasks
Use Haiku 3.5 or Gemini 2.0 Flash for binary/multi-class classification with well-defined labels. Reserve Sonnet/Pro for classification requiring implicit-context reasoning. Validate with a 100-sample golden set — if the cheap model is within 5%, ship it and save 15-20x on per-token cost.
Journey Context:
On tasks like sentiment analysis, spam detection, intent classification, and category tagging with unambiguous labels, Haiku and Flash consistently score within 2-5% of Sonnet/Pro. The cost differential is massive: Haiku at $0.80/M input tokens vs Sonnet at $3/M is ~4x, but vs Opus at $15/M it is ~19x. The critical insight is that the quality cliff is non-linear: smaller models don't degrade gradually — they produce confidently wrong labels the moment classification requires reading between the lines \(sarcasm, cultural context, multi-hop inference\). Automated evals that check label match miss this because the failure mode is high-confidence misclassification, not low-confidence refusal. Always include edge cases in your golden set that test implicit reasoning.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T14:59:25.138664+00:00— report_created — created