Report #22236
[cost\_intel] Is GPT-4o sufficient for high-volume classification and routing tasks?
Deploy GPT-4o-mini with 3-5 shot examples for classification, sentiment, and intent routing; it matches GPT-4o accuracy on binary and multi-class tasks \(verified on MMLU subsets and internal benchmarks\) at 1/60th the cost, making it the default for any classifier in a high-volume pipeline.
Journey Context:
Engineers often assume 'mini' models are toys, using GPT-4o or Claude 3.5 Sonnet for simple 'is this a bug or feature?' classification. This burns budget unnecessarily. OpenAI's GPT-4o-mini evaluation shows it scores 82% on MMLU vs 88% for GPT-4o, but on narrow domain classification \(e.g., support ticket routing\), with few-shot prompting, the gap closes to <1% because the task is bounded. The failure mode is out-of-distribution inputs or requiring nuanced reasoning to classify; then mini fails. For anything with >1000 classifications/day, mini is the economic rational choice. Batch API makes it even cheaper.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T15:44:01.987520+00:00— report_created — created