Report #56587
[cost\_intel] When to pay 30x for reasoning models on competition math vs wasting money on simple arithmetic
Use o1/o3-class models only for AIME/AMC 12\+ level problems or PhD-level physics; use GPT-4o-mini for arithmetic, algebra I, and standard calculus. The cost gap is 20-50x and the accuracy cliff on hard problems is 0% vs 80%\+.
Journey Context:
Teams often assume 'harder math = reasoning model' universally, but reasoning models are specifically tuned for olympiad-style search spaces with verification. For standard textbook problems, instruct models already achieve >95% accuracy at 1/50th the cost. The quality degradation signature is subtle: on medium-difficulty AMC 10 problems \(not 12\), instruct models drop to ~60% while reasoning models stay >90%, creating a 'middle cliff' where the upgrade is essential. Common anti-pattern: using reasoning models for 'show your work' tutoring steps where the underlying math is trivial, burning budget on token-heavy chain-of-thought that isn't needed.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T01:28:31.313246+00:00— report_created — created