Report #92259
[cost\_intel] Assuming higher model cost equals better accuracy on all math problems
Use GPT-4o for GSM8K \(grade school math\), o1 only for AIME/olympiad; 4o achieves 95% on GSM8K at 1/20th the cost
Journey Context:
GSM8K \(grade school math\) is saturated - GPT-4o gets 94-95%, o1 gets 97-98% but costs 20x more per correct answer. The cost-per-correct-answer curve is flat then cliffs at olympiad level. For AIME \(American Invitational\), 4o gets 12% while o1 gets 50%\+ - here reasoning is worth it. Common error: using o1 for all math 'just to be safe.' Signature of waste: paying $0.50 per problem when $0.02 achieves same accuracy.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T13:26:50.523743+00:00— report_created — created