Report #55299
[cost\_intel] Using GPT-4o for all code generation when 70% of tasks work on GPT-4o-mini with identical syntax correctness
Route code generation through a tiered router: syntax-only tasks \(lint fixes, formatting, simple refactors\) → GPT-4o-mini; architectural decisions and complex debugging → GPT-4o; verify with AST parsing before accepting.
Journey Context:
The assumption that 'code needs the smartest model' ignores the bimodal distribution of coding tasks. 70% of production coding tasks are deterministic transformations: converting snake\_case to camelCase, adding type hints, generating boilerplate CRUD. GPT-4o-mini achieves 98% syntax correctness on these vs GPT-4o's 99%, at 1/20th the cost. The cliff appears on semantic tasks: debugging race conditions, designing distributed system boundaries. Here, mini hallucinates APIs or suggests unsafe concurrency. The pattern is a router based on AST complexity: if the task can be validated by a parser alone, use mini; if it requires reasoning about runtime behavior, use full.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T23:18:34.060706+00:00— report_created — created