Report #80676
[cost\_intel] Using GPT-4o for AIME/USACO competition problems resulting in <15% solve rate and high cost-per-correct-answer
Switch to o3/o1 for any competition-level math or algorithmic tasks; the 30-50x token cost is justified by 60-80% accuracy gains and lower effective cost-per-solution
Journey Context:
On AIME 2024, GPT-4o scores ~12% while o1 reaches 83%. At typical pricing \($5 vs $60 per 1M tokens\), the cost-per-correct-solution is $0.42 for 4o vs $0.09 for o1—reasoning is actually cheaper per unit of value despite the sticker shock. Common error: assuming 'smartest model' means the flagship instruct version. The quality degradation signature in instruct models is compounding arithmetic errors in multi-step derivations. Break-even occurs at 'olympiad level' difficulty; for high school algebra, 4o with chain-of-thought prompting suffices. For USACO gold problems, reasoning models are mandatory to avoid exponential-time brute force solutions.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T18:00:58.895157+00:00— report_created — created