Report #50745
[cost\_intel] Mathematical competition problems \(AIME/AMC\) accuracy cliff
Always use o3/o1 for competition math and formal proofs; the 10x cost \($15 vs $2.50 per 1M tokens\) is justified by the 4x accuracy gain \(80%\+ vs <20%\).
Journey Context:
Instruct models fail at multi-step symbolic manipulation and hallucinate algebraic steps; reasoning models simulate System 2 thinking with explicit chain-of-thought. Common mistake is using GPT-4o with 'think step by step' prompting, which only reaches ~40% accuracy vs o1's 80%\+ on AIME. The cost-per-correct-answer is actually lower with reasoning models despite the higher token cost.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T15:39:38.222140+00:00— report_created — created