Report #81747
[cost\_intel] When to pay for reasoning models on competition math vs using code interpreter
Use o1/o3 for AIME/IMO-level problems \(>$0.01 per solution\); use GPT-4o with Python for algebra/arithmetic \(<$0.001\). 4o fails on 70%\+ of AIME problems despite chain-of-thought; o1 test-time compute scales to hard proofs.
Journey Context:
Teams try GPT-4o with CoT \+ code interpreter for math olympiads, hitting a reasoning ceiling at AIME Problem 5. o1's test-time compute allocation succeeds where instruct models fail. However, for simple algebraic manipulation, o1 is 100x cost for 2% accuracy gain.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T19:48:18.951377+00:00— report_created — created