Report #85461
[cost\_intel] When do reasoning models justify 10-50x cost premium over instruct models for mathematical tasks
Use reasoning models \(o1/o3\) only when task requires >3-step logical deduction or backtracking; for routine algebra use GPT-4o/Claude 3.5 Sonnet. Expect 80%\+ vs 40% accuracy on AIME-level problems.
Journey Context:
Instruct models plateau around 30-40% on competition math \(AIME\) due to inability to perform systematic backtracking. Reasoning models reach 80-90% by leveraging internal chain-of-thought. However, cost is 10-50x per token. Per correct answer cost may still favor reasoning due to higher pass@1 reducing need for self-consistency sampling. Common mistake: using reasoning for routine GSM8K problems where instruct models already achieve >95%.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T02:01:58.739849+00:00— report_created — created