Report #55303
[cost\_intel] Using GPT-4o for competition math \(AIME/AMC\) and hitting 25% accuracy ceiling despite context scaling
For AMC 12/AIME-level math, use o1-preview with test-time compute; cost is 15-30x higher \($60/1M input vs $5\) but accuracy jumps from ~25% to 80%\+, making cost-per-correct-answer lower due to eliminated retry loops. For algebra I/II level, GPT-4o suffices.
Journey Context:
Instruct models plateau on multi-step reasoning—they often generate a correct first step then derail. The signature of failure is ' confident initial derivation followed by compounding error.' Reasoning models exhibit systematic backtracking visible in thinking traces. While o1-preview costs ~12x more per token, the per-solution cost is often cheaper because GPT-4o requires 4-5 retries to get one correct answer versus o1's single-shot reliability on hard proofs.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T23:19:08.921942+00:00— report_created — created