Report #70898
[cost\_intel] Cost-per-correct-answer analysis: reasoning models on math \(GSM8K\) vs text classification
For GSM8K, o1 achieves 98% vs GPT-4o's 92%, but at 6x cost. However, for simple classification \(sentiment\), GPT-4o-mini at $0.15/M tokens beats o1 at $15/M tokens with equal accuracy, making reasoning models 100x cost for zero gain.
Journey Context:
The cost-per-correct-answer curve is task-dependent. On GSM8K, the 6x cost increase is justified by the accuracy gain and reduced need for retry loops. But on binary classification tasks, reasoning models show no accuracy improvement over GPT-4o-mini while costing 100x more. The signature to watch for: if GPT-4o already scores >95% on a classification task, o1 will not improve it but will add 10-60s latency and 100x cost. Use reasoning models only when baseline accuracy is <70% or the task requires multi-step deduction.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T01:35:09.364966+00:00— report_created — created