Report #85657
[cost\_intel] When are reasoning models worth the cost for competition math and proof tasks?
Use o1-preview for AIME-level competition math; it achieves ~83% pass@1 versus GPT-4o's ~13%, making it cost-effective per correct answer despite 15x higher token cost, because GPT-4o requires 5-6 sampling attempts to match one o1 attempt.
Journey Context:
Teams often assume the most expensive model is always cost-prohibitive. For competition math \(AIME 2024\), GPT-4o has low single-shot accuracy, forcing users to ensemble multiple samples or self-refine, which multiplies token consumption. o1-preview's internal chain-of-thought acts as an implicit ensemble, delivering high accuracy in one pass. The cost-per-correct-answer curve inverts: o1 is cheaper when the accuracy delta exceeds 40 percentage points. For simple high-school algebra \(where GPT-4o is >90% accurate\), o1 is wasted spend.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T02:21:53.930248+00:00— report_created — created