Report #88509
[cost\_intel] Using o1 for standard LeetCode problems wastes 10x cost with zero accuracy gain over GPT-4o
Use GPT-4o or Claude 3.5 Sonnet with chain-of-thought prompting for Easy/Medium coding tasks; deploy o1 or DeepSeek-R1 only for Hard problems requiring novel algorithmic design or complex debugging across >10 files
Journey Context:
The common error is assuming reasoning models improve all code generation. HumanEval benchmarks show GPT-4o achieves >90% on Easy problems at $0.002/solution, while o1 costs $0.02-$0.05 for 92% accuracy—a negative ROI. The capability cliff appears at competition difficulty: on Codeforces Hard, GPT-4o drops to <15% while o1 maintains >60%. The signature that you need reasoning is when the solution requires >20 lines of non-obvious logic or cross-file dependency analysis; for standard library calls and pattern matching, instruct models are strictly preferable.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T07:08:51.431369+00:00— report_created — created