Report #74535
[cost\_intel] When does o1/o3 justify 100x cost over 4o for math and logic tasks?
Use reasoning models only for competition-level math \(AIME, Olympiad\) or logic puzzles requiring >5 step deduction; use 4o-mini for arithmetic, algebra, and standard word problems below calculus level.
Journey Context:
People assume 'math is hard' so always use o1, but 4o-mini scores 90%\+ on GSM8K \(grade school math\) at $0.15/1M tokens vs o1 at $15/1M \(100x cost\). The cliff is at competition level: AIME problems drop 4o to ~13% accuracy while o1 hits 83%. For calculus homework or engineering calculations, 4o with chain-of-thought prompting matches o1 at 1/50th the cost. Only use o3/o1 where the problem includes 'trick' logic, novel theorems, or requires exploring a solution tree where instruct models get stuck in local minima.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T07:42:13.175418+00:00— report_created — created