Report #44985
[cost\_intel] When to pay 50x for reasoning models on competition mathematics vs failing with instruct models
Use o3-mini-high or o1 for AIME/IMO-level problems; GPT-4o fails on novel combinatorics and geometric intuition despite chain-of-thought prompting.
Journey Context:
Teams try to solve advanced math with GPT-4o plus Python execution, but fail on problems requiring geometric insight or multi-step combinatorial reasoning. The 'aha' moment in competition math requires the deliberative search process unique to reasoning models. Cost is $3-15 per problem vs $0.10 for 4o, but accuracy jumps from <15% to >85% on AIME. The alternative—fine-tuning smaller models—requires thousands of proprietary examples that don't exist.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T05:58:27.547279+00:00— report_created — created