Report #69081
[cost\_intel] Using reasoning models for standard high-school math \(SAT/ACT level\) where GPT-4o already exceeds 95% accuracy
Reserve o1/o3 for competition-level mathematics \(AIME, IMO, Putnam\) and complex proofs; use GPT-4o with chain-of-thought prompting for all standard academic and business math
Journey Context:
The MATH benchmark shows o1-mini scoring ~90% and o1-preview ~92%, while GPT-4o scores ~60% on the full competition set. However, on the subset of standard calculus and algebra problems \(SAT/ACT equivalent\), GPT-4o with few-shot CoT reaches >95% accuracy. The trap is paying the 20-30x token cost for o1 on problems where 4o is already saturated. The cost-per-correct-answer for standard math is $0.001 for 4o vs $0.03 for o1. Only escalate to reasoning models when the problem involves olympiad-level geometry or formal proof verification where 4o drops below 70%.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T22:26:11.670045+00:00— report_created — created