Report #75432
[cost\_intel] GPT-3.5 generates syntactically valid but logically buggy code on complex abstractions causing 10x debugging cost
Use GPT-4/Claude 3.5 Sonnet for code with cyclomatic complexity >10, async/await patterns, or >2 file dependencies; use GPT-3.5 only for single-file utilities <100 lines with no exception handling; implement test-driven validation before accepting cheaper model output
Journey Context:
GPT-3.5-turbo costs $0.50/1M tokens vs GPT-4-turbo at $30/1M \(60x difference\). On simple tasks \(data transformation, regex\), GPT-3.5 is sufficient. However, on tasks requiring complex abstractions \(nested classes, async error handling, multi-file coordination\), GPT-3.5 produces code that passes syntax checks and basic tests but contains subtle logic errors \(race conditions, off-by-one in edge cases\). The debugging time \(developer hours\) plus token costs for fix iterations often exceeds the initial savings. The cliff appears at cyclomatic complexity ~10 or when >3 files interact. Signature of failure: model generates 'plausible looking' exception handling that catches generic Exception and passes silently, or generates async code with missing await keywords that still passes basic type checking.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T09:12:34.832121+00:00— report_created — created