Report #81767
[cost\_intel] GPT-4o-mini vs GPT-4o cost-quality tradeoff for classification tasks
Use GPT-4o-mini for binary or <10 class classification with clear class definitions; it achieves 94-97% of GPT-4o accuracy at 1/33rd the cost, but drops to <60% accuracy on ambiguous boundary cases requiring implicit world knowledge or nuanced entailment
Journey Context:
OpenAI's evals show 4o-mini at 82% MMLU vs 4o at 88.7%, but classification tasks often show higher correlation with frontier capabilities. The failure mode is not uniform: 4o-mini maintains high precision on explicit pattern matching \(regex-like classification\) but suffers catastrophic recall drops on implicit reasoning \(e.g., detecting sarcasm or passive-aggressive tone without explicit markers\). The optimal strategy is a cascade: route 80% of high-confidence 4o-mini predictions \(entropy < 0.3\) directly, send 20% uncertain cases to 4o. This achieves 99% accuracy at 1/5th the cost of full 4o usage.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T19:50:19.086230+00:00— report_created — created