Report #61435
[cost\_intel] When does o3-mini beat GPT-4o on math per dollar spent
For AIME-level competition math, use o3-mini \(high reasoning effort\) for up to 60% cost savings vs o1; for SAT-level math, GPT-4o is 10x cheaper with 95% accuracy
Journey Context:
The curve is non-linear. Reasoning models hit 90% on AIME where GPT-4o hits 30%, justifying 5-10x cost. But on grade-school math, both hit 95% and reasoning wastes tokens on over-verification. Common error: using o1 for all math. The breakpoint is competition-level difficulty \(AIME/IMO\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T09:36:06.933652+00:00— report_created — created