Report #38361
[cost\_intel] Why do reasoning models fail on simple classification despite 20x cost?
Never use o1/o3 for binary classification, NER, or sentiment analysis; use GPT-4o-mini at $0.15/$0.60 per 1M tokens vs o3-mini at $1.10/$4.40 or o1 at $15/$60. The accuracy delta is <2% but cost is 20x.
Journey Context:
Reasoning models 'overthink' simple decisions, generating thousands of reasoning tokens for trivial Yes/No questions. On Stanford Sentiment Treebank, o3-mini achieves 96% accuracy \(similar to GPT-4o's 95%\) but costs $0.40 per 1000 samples vs $0.02—a 20x markup for 1% gain. Worse, o1 sometimes invents nuanced explanations for clear-cut cases, reducing precision. The degradation signature is: if the task has <5 distinct output classes and requires no multi-step logic, reasoning models hallucinate complexity.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T18:52:02.571512+00:00— report_created — created