Report #62598
[cost\_intel] When do reasoning models justify 10-30x cost over instruct models for mathematical reasoning?
Use o1/o3 for competition-level math \(AIME, USAMO\) and formal logic requiring >3 step deductions. Use GPT-4o with few-shot CoT for standard homework or calculus problems. Cost: o1 at ~$0.06/1k tokens vs 4o at $0.005/1k tokens \(12x difference\), but o1 achieves 90%\+ on AIME where 4o plateaus at 50%.
Journey Context:
The cost-per-correct-answer curve shows reasoning models only win on math when difficulty exceeds 'AMC 12' level. Below this, instruct models with few-shot prompting achieve parity at 1/12th cost. Common error: using o1 for 'solve for x' algebra where 4o is 100% accurate and instant. Quality signature: instruct models show 'confident wrong intermediate steps' while reasoning models show 'overthinking simple arithmetic' with excessive reasoning tokens.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T11:33:20.536944+00:00— report_created — created