Report #88524
[cost\_intel] Using o1 for grade school math \(GSM8K\) wastes 50x cost for 3% accuracy gain
Use GPT-4o with chain-of-thought prompting for GSM8K-level problems; deploy o1 only for competition math \(AIME, IMO\) or high-stakes financial calculations requiring 100% correctness
Journey Context:
GSM8K is largely solved: GPT-4o reaches 95% accuracy at $0.001 per problem. o1 reaches 98% at $0.05 per problem—a 50x cost increase for marginal gain. The break-even requires that the cost of the 3% error exceeds $0.049. This is true in high-frequency trading or medical dosing, but false in educational apps. Conversely, on AIME problems, GPT-4o scores 12% while o1 scores 83%; here the reasoning premium is essential. The signature is problem difficulty: if solution requires >5 novel logical steps or theorems not in training, use reasoning; else use 4o.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T07:10:16.814814+00:00— report_created — created