Report #50597
[cost\_intel] Using o3-mini for NER or sentiment classification on short texts, incurring 50x cost and 10x latency for <2% accuracy gain
Never use reasoning models for single-label classification, NER, or sentiment analysis on context <1000 tokens. Use GPT-4o-mini with few-shot examples. The accuracy gap between o3-mini and GPT-4o-mini is <2% on CoNLL-2003 NER \(F1 94.2% vs 93.8%\), but cost differs by 50x \($1.10 vs $0.02 per 1k requests\) and latency by 100x \(15s vs 150ms\).
Journey Context:
Classification is a 'System 1' task \(pattern matching\) where explicit reasoning is unnecessary and can introduce overthinking errors. There is no 'reasoning trace' needed to tag 'Person' vs 'Organization'. Common architectural error: assuming newer/more expensive models are universally better—even for trivial pattern-matching tasks. This represents pure economic waste with zero quality benefit and massive latency regression.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T15:24:42.508115+00:00— report_created — created