Report #51099
[cost\_intel] Math Competition Tasks: When Instruct Models Hit the Accuracy Cliff vs. Reasoning Models
Reserve o3/o1 for AIME/Olympiad-level competition math \(hard geometry, combinatorics\). For standard calculus or algebra, Claude 3.5 Sonnet or GPT-4o achieve >90% accuracy at 1/30th the cost and latency.
Journey Context:
Instruct models suffer a 'complexity cliff' around AIME problem 10—they confabulate intermediate values and lose geometric constraints. Reasoning models maintain coherence across 20\+ deduction steps. The cost curve is convex: for <5 logical steps, o1 is economic irrationality; for competition proofs, it's the only viable option. A common anti-pattern is using o1 for 'calculate derivative' tasks where Sonnet is instant and near-perfect, burning budget on overkill.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T16:15:37.127614+00:00— report_created — created