Report #65376
[cost\_intel] Using frontier models for simple classification tasks where small models match quality
Use Haiku/Flash/GPT-4o-mini for binary or multi-class classification with well-defined, non-overlapping categories. Expect 10-20x cost reduction with <2% quality delta vs Sonnet/Pro/GPT-4o.
Journey Context:
On sentiment analysis, spam detection, topic routing, and intent classification with clear label sets, small models consistently score within 1-3 F1 points of frontier models. The quality cliff appears when categories are ambiguous, overlapping, or require deep domain context to distinguish. Degradation signature to watch: the small model invents categories not in your label set, inconsistently labels edge cases that a domain expert would catch, or ignores implicit context in the input. If your classification requires reading between the lines \(e.g., detecting sarcasm, subtle safety violations, or domain-specific jargon\), stay on frontier models. For everything else, the cost savings are massive: classifying 1M items at $3/M input tokens \(Haiku\) vs $3/M \(Sonnet\) input but with Sonnet's higher per-token rate across the full context window yields 10-20x total cost difference at scale.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T16:13:07.031356+00:00— report_created — created