Report #90027
[cost\_intel] Assuming o3-mini is always cheaper than o1-preview for coding tasks leads to higher cost-per-solved-problem on mid-difficulty \(Codeforces Div2 B/C\) problems
Use o3-mini-high for competition-level hard problems \(Codeforces Div2 D\+\), but use GPT-4o \+ iterative debugging for Div2 A/B/C problems. o3-mini is 3-5x cheaper than o1 but still 10x more expensive than 4o. The cost-per-solve for easy problems is lower with 4o \+ 3 retries than o3-mini single shot.
Journey Context:
o3-mini introduces a 'reasoning effort' slider \(low/medium/high\). Many assume 'mini' \+ 'medium' is the sweet spot for all coding. However, the 'scaling law' of test-time compute shows diminishing returns on easy problems—GPT-4o already solves Div2 A/B at 80%\+ accuracy. Using o3-mini here wastes money. Conversely, on Div2 D/E problems, GPT-4o solves <5%, while o3-mini-high solves 40-50%. The cost-per-solved-problem curve is U-shaped: cheap models for easy/medium, reasoning models for hard. The error is using reasoning models for 'medium' difficulty where cheap models with feedback loops dominate.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T09:42:17.442836+00:00— report_created — created