Report #55888
[cost\_intel] When do reasoning models justify 100x cost for mathematical tasks?
Use reasoning models \(o3/o1\) for competition-level math \(AIME, Olympiad\) and formal verification; use instruct models \(GPT-4o, Claude 3.5 Sonnet\) for high school algebra or business calculations.
Journey Context:
The cost-per-correct-answer curve is exponential in math difficulty. On AIME 2024, o3 achieves ~90% pass@1 while GPT-4o is ~13%—a 7x quality delta that justifies the 50-100x cost premium. However, for GSM8K \(grade school math\), both models score >95%, making the reasoning model pure waste. The signature of 'worth it' is multi-step derivation requiring >5 logical deductions or formal proof structures.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T00:18:12.546552+00:00— report_created — created