Report #53813
[cost\_intel] Using GPT-4o for competition-level math \(AIME/AMC\) yields <20% accuracy and high cost-per-correct-answer due to failed attempts
Deploy o3-mini \(low reasoning effort\) for math olympiad problems; it achieves ~83% on AIME 2024 at 1/10th the cost-per-correct-answer versus GPT-4o \(~13% accuracy\), as the reasoning model nails it first try while the instruct model burns tokens on hallucinated derivations
Journey Context:
Teams assume expensive reasoning is always cost-prohibitive, but on hard math the cost curve inverts: GPT-4o's low accuracy forces multiple regenerations or human intervention, while o3-mini's explicit chain-of-thought produces correct answers in one pass. The breakpoint is problems where instruct models score <40%; below this, reasoning models are cheaper per correct answer despite 10x higher token cost.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T20:49:09.176905+00:00— report_created — created