Report #91248
[cost\_intel] Linear cost scaling assumption between passes in code generation
Use GPT-4o-mini for first 3 passes \(temperature 0.7\), escalate to o3-mini only if unit tests fail 3x; expect 10x cost reduction on easy bugs
Journey Context:
SWE-bench lite shows o3-mini solves 48% vs GPT-4o 31%, but at 20x cost. However, 60% of GPT-4o failures are syntax errors caught by linter. Chaining cheap model \+ linter \+ reasoning only on complex logic yields 90% of o3-mini success at 15% of cost. Quality degradation signature in cheap models: 'superficial' patches that fix syntax but break semantics.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T11:45:11.039754+00:00— report_created — created