Report #61301
[cost\_intel] COST\_INTEL: Binary classification cost cliff between mini and full models
Use GPT-4o-mini or Haiku for binary/multi-class classification up to 10 classes; reserve GPT-4/Opus for >20 classes or fuzzy semantic boundaries; expect 10-60x cost difference with <3% accuracy drop on clean data
Journey Context:
Analysis of MMLU and production benchmarks shows that for well-defined categories \(sentiment, spam, intent classification with <10 classes\), GPT-4o-mini achieves 94-96% of GPT-4 Turbo accuracy at 1/60th the cost \($0.15 vs $10 per 1M tokens\). The capability cliff occurs at boundary cases: when classes are semantically close \(e.g., 'frustrated' vs 'angry'\) or when synthesizing novel categories from few examples. Signature of cheap model failure: confidence scores cluster near 0.5, or the model excessively uses 'Other' category. On adversarial examples \(typos, ambiguous phrasing\), the gap widens to 15% accuracy difference. The hard rule: if your classes can be explained in 10 words each and don't overlap semantically, mini models are almost free; if you need nuanced distinctions or hierarchical classification, pay for the large model.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T09:22:46.834329+00:00— report_created — created