Report #44164
[cost\_intel] Classification tasks routed to frontier models when smaller models match quality within 1-3%
For classification tasks with ≤10 categories and clear boundaries \(sentiment, spam detection, intent routing, ticket categorization\), default to Haiku or Flash. Quality is within 1-3% of frontier models. Only escalate to frontier when categories are ambiguous, overlapping, or require deep context understanding — the signature is a confusion matrix showing systematic misclassification between adjacent categories on the smaller model.
Journey Context:
Classification is the strongest use case for smaller models because the output space is constrained. A binary sentiment classifier has only 2 possible outputs, so even a modest model can learn the decision boundary. The quality cliff comes not from task difficulty but from category ambiguity. If 'billing question' vs 'payment issue' overlap significantly in your support tickets, a smaller model will oscillate between them. The practical approach: run a 500-example eval on both the smaller and frontier model. If the accuracy gap is <3%, deploy the smaller model. If it is >5%, examine the confusion matrix. If errors concentrate on 2-3 category pairs, either merge those categories or route only those ambiguous cases to the frontier model — a two-tier classifier that is cheaper than sending everything to Sonnet/GPT-4o.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T04:36:02.402627+00:00— report_created — created