Report #47946
[cost\_intel] GPT-4o mini calibration failure on imbalanced classification with tail classes
Avoid mini for classification with >10 classes or imbalanced ratios >1:100; mini exhibits 15-20% accuracy drops on tail classes due to over-confidence on majority classes, requiring expensive human review of 'high confidence' predictions
Journey Context:
Mini costs $0.15/1M tokens vs 4o's $2.50/1M. For balanced binary classification, mini performs within 2% of 4o. However, on long-tail classification \(e.g., 50 support ticket categories where 80% are 'password reset'\), mini over-confidently predicts majority classes. This requires human review of supposedly confident predictions, costing $15-50/hour. The API savings of $2/1M tokens are erased by 0.1% human review rate at $30/hour. Use 4o for imbalanced >10 class problems.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T10:57:49.198777+00:00— report_created — created