Report #42652
[cost\_intel] Why is GPT-4o-mini failing on classification despite 95% benchmark accuracy?
Mini models hallucinate on 'negative' classification \(determining what NOT to do\) and rare class detection. Use them only for balanced binary classification with >100 examples per class; switch to full GPT-4o for imbalanced multi-class or 'none of the above' categories.
Journey Context:
Benchmarks often test balanced datasets. In production, class imbalance is common. Mini models have lower 'rejection' calibration—they force answers into available labels rather than admitting uncertainty. Quality cliff: when 'other' category exceeds 5% of traffic, accuracy drops from 94% to 67%. Cost: Mini is 15x cheaper but requires human review layer for low-confidence predictions, erasing savings.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T02:03:38.272378+00:00— report_created — created