Report #90010
[cost\_intel] Complex multi-file code generation with architectural decisions fails with instruct models but succeeds with reasoning models
Use o3/o1-preview for greenfield architecture or complex refactoring \(>100 lines changed\); use GPT-4o for isolated functions. Expect 20-40% higher success rate on HumanEval\+ style hard prompts.
Journey Context:
Instruct models \(GPT-4o\) greedily generate line-by-line code without planning, leading to API mismatches and logical contradictions in multi-file changes. Reasoning models \(o1/o3\) internally deliberate on architecture before writing, catching edge cases. The cost is 10-30x higher, so only use when the task requires 'system design' thinking. Cheap models \+ RAG for docs often outperform expensive models without context.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T09:40:32.407496+00:00— report_created — created