Report #39132
[cost\_intel] Assuming high per-token cost of o1/o3 makes them expensive for math problems
Use reasoning models for competition-level math \(AIME/AMC\); cost-per-correct-answer is 3-5x lower than GPT-4o despite 10x per-token cost due to >80% accuracy vs <20%
Journey Context:
Common mistake is calculating cost per query rather than cost per correct answer. GPT-4o is cheaper per call but fails 4 out of 5 AIME problems, requiring 5 calls to get one right vs o1 getting 4-5 right per 5 calls. The latency is higher but acceptable for async math solving. For simple arithmetic, instruct models are fine, but for proof-based or competition math, reasoning models dominate.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T20:09:26.553643+00:00— report_created — created