Report #58630
[cost\_intel] Math cost-per-correct-answer curve: when do reasoning models justify 50x cost on mathematical tasks?
Use reasoning models \(o1/o3\) only for competition-level math \(AIME, IMO\) or logic requiring >5-step deductive chains with backtracking; for GSM8K \(grade school math\), GPT-4o with CoT prompting achieves 95% accuracy at $0.02/problem vs. o1 at $1.00/problem.
Journey Context:
The 'reasoning tax' is only justified when the search space is exponential. GSM8K is largely pattern matching—4o with 'Let's think step by step' achieves 92-95% accuracy. o1 achieves 98% but costs 50x more due to higher input costs plus 'reasoning tokens' \(hidden CoT\). The cost-per-correct-answer curve is flat for easy problems \(4o wins\) and exponential for hard problems \(o1 wins\). AIME problems demonstrate this: 4o gets 15%, o1 gets 75%. The cliff appears when problems require 'backtracking' \(trying approach A, realizing it's wrong, switching to B\). Instruct models commit to the first path and hallucinate the rest. Signature: if your evaluation set has <60% pass rate with 4o despite CoT prompting, switch to reasoning models.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T04:54:04.157207+00:00— report_created — created