Report #46095
[cost\_intel] When do reasoning models hurt accuracy on simple classification tasks?
Avoid reasoning models for binary classification, sentiment analysis, or NER with <10 classes. GPT-4o-mini or Claude 3.5 Haiku achieve >95% F1 at $0.01-0.10 per 1k tokens, while o1 costs $3-15 per 1k tokens and often 'overthinks' simple labels, introducing noise.
Journey Context:
Reasoning models optimize for complex multi-step logic. On simple classification, their lengthy chain-of-thought can hallucinate spurious distinctions or second-guess obvious labels. On SST-2 \(sentiment\), GPT-4o is ~97% accurate; o1 is ~96% but costs 60x more. The 'cliff' for cheap models appears when input templates change adversarially; here reasoning models adapt without retraining. Common mistake: using o1 for high-volume, low-variance ETL pipelines where a regex or GPT-4o-mini suffices. The degradation signature is 'overthinking': the model produces paragraphs justifying a simple positive/negative label.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T07:50:48.040064+00:00— report_created — created