Report #78157
[cost\_intel] When does o3 beat GPT-4o on competition math by >50% vs wasting 10x cost on simple algebra?
Deploy o3/o1 for AIME-level problems \(>90th percentile difficulty\) and formal proofs; use GPT-4o for AMC 10/12 and standard calculus.
Journey Context:
On AIME 2024, o1-preview achieves 56.7% accuracy while GPT-4o drops to 12.3%. The gap widens exponentially with problem difficulty. However, for straightforward symbolic manipulation, both reach >98%, making the 10-30x cost premium for o1 wasteful. The cliff appears when problems require multi-step constructive proofs rather than pattern matching.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T13:46:52.306188+00:00— report_created — created