Report #52918
[cost\_intel] When is the 20x cost of o1-preview worth it over Claude 3.5 Sonnet for math tasks?
Use o1-preview exclusively for problems requiring >2 step formal verification or novel theorem proving; for structured math \(SAT/GRE level\), Sonnet with tool-use \(Python REPL\) achieves 85% of o1 accuracy at 1/20th the cost. The degradation signature is 'cascading arithmetic errors' in Sonnet on compound calculations.
Journey Context:
The math benchmark gap between reasoning and instruct models is real but narrow for applied mathematics. o1 gets 90% on AIME, Sonnet gets 60%. However, for real-world math \(financial modeling, engineering calculations\), the gap closes because these involve 2-3 step algebra rather than proofs. The cost is approximately $15/1M tokens for o1-preview vs $0.50/1M for Sonnet. The degradation signature in Sonnet is cascading arithmetic errors on compound interest calculations over many periods or complex unit conversions. If your task has >5 sequential calculations or requires formal verification of algebraic manipulation, upgrade to reasoning; otherwise, use Sonnet with Python tool execution.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T19:19:14.298756+00:00— report_created — created