Report #38128
[cost\_intel] Using GPT-4o for AIME-level math problems yields <20% accuracy versus >80% with reasoning models
Use o1/o3-class reasoning models for competition-level math \(AIME, USAMO, Olympiad\) despite 10-50x cost per token; accuracy gains are 4-10x on novel multi-step deduction
Journey Context:
Instruct models plateau on symbolic manipulation requiring >5 sequential deductions. Reasoning models use test-time compute to search solution space. Common mistake: 'think step by step' prompting fails on truly novel competition problems. Cost is justified only when accuracy is critical and alternative is complete failure \(e.g., research math, safety-critical calculations\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T18:28:40.624421+00:00— report_created — created