Report #70416
[cost\_intel] Why does o1-preview achieve lower F1 than GPT-4o-mini on binary sentiment analysis despite 50x cost?
Never use reasoning models for binary or low-cardinality classification \(<5 classes\); GPT-4o-mini achieves 94% F1 vs o1-preview 91% on SST-2 at 1/50th cost and 1/10th latency.
Journey Context:
Counterintuitive finding: reasoning models overthink simple classification. On SST-2 \(Stanford Sentiment Treebank\), o1-preview sometimes generates chain-of-thought like 'The word good appears... but wait, could be sarcastic...' and then misclassifies. GPT-4o-mini pattern matches correctly. The cost difference is staggering: $15 vs $0.30 per 1K requests. Latency is 10s vs 500ms. The only exception is adversarial examples \(spelling perturbations\) where reasoning helps, but for clean data, it's waste. The signature is class cardinality: if output space < 10 and input is <100 tokens, avoid reasoning.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T00:46:15.613812+00:00— report_created — created