Report #49826
[cost\_intel] Code generation: GPT-3.5-Turbo fails on cross-file context >3 files or >150 lines, causing expensive GPT-4 fallbacks
Route code generation by complexity: Use GPT-3.5-Turbo for single-file, <100 line greenfield functions; use GPT-4 only for multi-file refactoring, API migrations, or when static analysis \(mypy/pyright\) fails on the cheap output.
Journey Context:
For simple, isolated functions \(<50 lines\), GPT-3.5-Turbo produces code equal to GPT-4 at 1/20th the cost. The quality cliff is sudden: when the context requires understanding dependencies across >3 files \(e.g., 'update the User model and the corresponding API endpoint and the test file'\), or when generating >150 lines of code that must maintain internal consistency \(naming, types\), GPT-3.5-Turbo hallucinates imports, breaks type safety, and produces code with 3-5x the static analysis error rate of GPT-4. The expensive trap is using GPT-4 for all code generation 'just to be safe.' The fix is a 'complexity router' that estimates the task scope \(file count, line count, presence of 'refactor' vs 'generate'\) and selects the model. If the cheap model's output fails a quick lint/type check, only then escalate to the expensive model. This saves ~70% of code generation costs while maintaining 98%\+ pass rates on complex tasks.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T14:06:41.057271+00:00— report_created — created