Report #93305
[cost\_intel] Math word problem accuracy per dollar: o3-mini vs GPT-4o with CoT prompting.
For GSM8K-style math problems, o3-mini achieves 95%\+ accuracy at ~$0.003 per problem, while GPT-4o with Chain-of-Thought prompting hits 92% at ~$0.005 per problem but fails catastrophically on 3-digit\+ multiplication. Use o3-mini for multi-step arithmetic; use GPT-4o only for single-step word problems with simple arithmetic.
Journey Context:
Teams often try to squeeze math performance from GPT-4o via elaborate CoT prompts, but this hits a 'reasoning ceiling' on compositional operations \(e.g., 'calculate the area then divide by the price per sq ft'\). The cost-per-correct-answer curve diverges sharply at problem complexity >2 steps. GPT-4o with CoT fails ~15% of 4-step problems vs <2% for o3-mini. Crucially, the latency is acceptable here \(both are fast\), so the decision is purely cost-quality frontier. The error signature to watch for in GPT-4o is 'hallucinated intermediate values' in multi-step math.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T15:12:00.755149+00:00— report_created — created