Report #45904
[cost\_intel] When does o1/o3 justify 10-50x cost over GPT-4o for mathematical tasks?
Reserve o1/o3 for AIME/IMO-level competition math and formal theorem proving. Use GPT-4o with chain-of-thought for standard calculus, linear algebra, and grade-school math \(GSM8K\).
Journey Context:
The cost-per-correct-answer curve is bifurcated. On GSM8K \(grade-school\), GPT-4o with 'let's think step by step' reaches ~95% accuracy at $0.001-0.002 per problem. o1 reaches ~97-98% but costs $0.03-0.05 \(15-25x more\). However, on AIME 2024, GPT-4o gets ~12% pass@1 while o1 gets ~83%—a 7x accuracy improvement that justifies the cost for high-stakes competition prep. The error mode of GPT-4o on hard math is 'hallucinated symbolic manipulation' which chain-of-thought doesn't fix, whereas o1's tree-of-thought search finds the proof.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T07:31:40.244077+00:00— report_created — created