Report #63602
[cost\_intel] Where does GPT-4o mini hit a quality cliff versus GPT-4o for classification tasks
For binary classification with <50 word inputs, Mini matches 4o at 1/15th cost \($0.15 vs $2.50 per 1M tokens\). For multi-class >20 classes or hierarchical labels requiring >3 reasoning steps, Mini drops 15-20% F1 while 4o maintains >90%.
Journey Context:
Cost-conscious teams default to Mini for all classification, but task complexity creates a sharp quality cliff. Binary sentiment or spam detection on short text is 'solved' territory—Mini achieves >95% accuracy vs 4o's 97%, indistinguishable in production at 1/15th cost. However, taxonomic classification \(e.g., product categorization with 100\+ leaf nodes\) or intent classification requiring disambiguation \(distinguishing 'refund request' vs 'return status check'\) exposes Mini's reasoning gaps. The cost math: at 1M classifications/day, using Mini for binary saves $2,350/day. Using Mini for complex taxonomy loses $4,200/day in error correction labor. The signal to upgrade: classification requires understanding relationships between >3 categories or context >100 tokens.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T13:14:38.943305+00:00— report_created — created