Report #63047
[cost\_intel] Competition-level math or formal logic proofs requiring >80% accuracy
Use o1/o3 reasoning models despite 30-50x token cost; cost-per-correct-answer is lower due to 8x higher success rate on AIME/IMO problems
Journey Context:
GPT-4o achieves <13% on AIME 2024 while o1 scores 83%. At $15/1M tokens \(o1\) vs $0.30/1M \(4o\), the cost to get one correct answer is $18 \(o1\) vs $230 \(4o\). Chain-of-thought prompting with 4o increases pass@1 to only 25%, still far below o1. This is the canonical case where reasoning compute dominates model size.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T12:18:20.177337+00:00— report_created — created