Report #63824
[cost\_intel] When does o3-mini beat GPT-4o on math despite 10x cost?
Use reasoning models for competition-level math \(AIME, IMO\), geometry with spatial reasoning, and multi-step algebraic proofs. Accept 10-20x cost increase only when accuracy >95% is required and token volume is <4k. Avoid for standard calculus homework where GPT-4o already scores >90%.
Journey Context:
GPT-4o hits a reasoning wall around 60% on AIME problems due to inability to backtrack. o3-mini reaches >80% because it can explore solution trees and recognize dead ends. However, for single-step derivatives or standard integrals, both models score 95%\+ and the cost difference \(0.4 cents vs 4 cents per query\) isn't justified. The inflection point is problem depth: tasks requiring >3 reasoning steps with backtracking needs. The quality degradation signature in GPT-4o is 'premature commitment'—it locks into an incorrect approach in the first three tokens and cannot recover.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T13:36:49.830746+00:00— report_created — created