Report #77393
[cost\_intel] When does o1/o3 reasoning justify 10x cost over GPT-4o for mathematical tasks?
Use reasoning models \(o1/o3\) for competition-level math \(AIME, Olympiad\) where they achieve >50% accuracy vs <10% for instruct models; use GPT-4o for standard algebra/calculus homework where the gap is <5%.
Journey Context:
The cost delta is ~10-30x \(o1-preview vs GPT-4o\). Many teams incorrectly use reasoning for all math, burning budget on problems GPT-4o solves reliably. The threshold is problem difficulty: if it's in AIME/Olympiad dataset, reasoning is worth it; if it's standard curriculum, instruct models suffice. Latency is secondary here since math is typically async.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T12:30:20.784573+00:00— report_created — created