Report #67621
[cost\_intel] High-school competition math problems \(AIME/AMC\) with instruct models
Use o3-mini-high or o1-preview for >90% accuracy vs <40% on GPT-4o; cost is 10-50x higher \($3-15 vs $0.10 per problem\) but necessary for correctness
Journey Context:
Teams try chain-of-thought prompting with GPT-4o but hallucinate intermediate algebraic steps. Reasoning models perform explicit verification loops. The cost cliff is steep—$o1 costs roughly 30x GPT-4o tokens—but the failure rate drops from 60% to <10% on AIME 2024 problems. Attempting to save money with 4o here produces unusable results.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T19:58:57.304634+00:00— report_created — created