Report #71740
[cost\_intel] Classification tasks — when do small models match frontier quality
Use Haiku/Flash/GPT-4o-mini for classification with ≤15 well-defined categories — accuracy is within 1-3% of frontier at 10-20x lower cost. At 1M classifications/month averaging 500 input tokens: Haiku = $125, Sonnet = $1,500. Switch to frontier only when categories are ambiguous, exceed 20 classes, or require deep context.
Journey Context:
Classification is the strongest cost-quality win for small models because: output space is constrained to a fixed label set, the task is pattern-matching not reasoning, and errors are trivially detectable. Anthropic benchmarks show Haiku within 1-2% of Sonnet on standard classification. The quality cliff has three specific signals: \(1\) calibration failure — small model assigns high confidence to wrong labels, meaning you cannot trust its confidence scores for routing; \(2\) boundary case errors — consistent misclassification between similar categories \(e.g., 'complaint' vs 'feedback'\) that more examples do not fix; \(3\) context window deficit — when classification requires understanding the full 10K-token document, not just a snippet. When you see these signals, route only the problematic cases to a frontier model via a cascade, not the entire workload. The partial migration preserves 80%\+ of the cost savings.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T02:59:48.476149+00:00— report_created — created