Report #46470
[cost\_intel] Using instruct models for competitive math or formal verification
For tasks requiring >3-step mathematical reasoning \(AIME level, formal verification, complex algorithmic proofs\), o3/o1 provide 40-60% accuracy vs <20% for GPT-4o, justifying 20-50x cost premium; for 1-2 step arithmetic or algebra, instruct models suffice.
Journey Context:
There's a clear 'cognitive threshold' in math. Instruct models plateau around high-school competition level \(AMC 10/12\) because they lack explicit chain-of-thought search. Reasoning models use tree-of-thought search, breaking through to AIME/USAMO and formal math \(Lean proofs\). The cost-per-correct-answer curve shows instruct models become exponentially expensive as task difficulty increases \(due to retry loops\), while reasoning models scale linearly. Quality signature: instruct models give confident wrong answers with plausible-looking but flawed logic; reasoning models show their work, making errors detectable.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T08:28:23.638854+00:00— report_created — created