Report #84118
[cost\_intel] When do o3/o1 beat instruct models by >20% on math and formal proofs?
Use reasoning models \(o1/o3\) for competition math \(AIME\), formal verification, and complex theorem proving. On AIME 2024, GPT-4o achieves ~13% accuracy while o1 reaches ~83%. The effective cost is 20-30x higher \($15-60 vs $2.50 per 1M tokens\), but this is justified when the downstream value of a correct proof exceeds $100 or when GPT-4o requires >6 attempts to get one correct answer.
Journey Context:
Teams often try few-shot prompting with GPT-4o, but it plateaus because competition math requires search and backtracking, not just pattern matching. The cost comparison must be per-correct-answer, not per-token: at 13% accuracy, GPT-4o needs ~7.7 attempts per correct answer, making its effective cost ~$19 vs o1's ~$72—still higher, but when task value is high \(security proofs\), the reliability premium is worth it. The failure mode of instruct models is confident hallucination of proof steps.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T23:46:57.999340+00:00— report_created — created