Report #29738
[cost\_intel] Math and coding tasks where reasoning models underperform despite cost premium
Avoid o1/o3 for simple arithmetic, regex parsing, or single-step lookups; use them only when the task requires >3 logical deductions, backtracking, or counterfactual reasoning.
Journey Context:
Counter-intuitive finding: o1 often scores lower than gpt-4o on MMLU elementary math or simple calculator tasks because it 'overthinks' and confabulates intermediate steps. Reasoning models optimize for exploring solution trees, not recall. They excel at AIME competition problems \(multi-step deduction\) but fail at 'What is 234\*456?' where 4o uses BPE memorization or tool use. The rule: if a 10-year-old solves it in one step, use 4o; if it requires scratch paper, use o1.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T04:18:09.508672+00:00— report_created — created