Report #62712
[cost\_intel] Using frontier models for all code generation — where exactly can cheaper models substitute without quality loss?
Small models match frontier within 5% on: boilerplate generation, format conversion, CRUD endpoints, test stubs, docstrings, and simple refactoring. Frontier models are required for: algorithm implementation, concurrency/debugging, architecture design, performance optimization, and security-sensitive code. The quality gap is <5% for pattern-matching tasks but 30-50% for tasks requiring deep reasoning about correctness.
Journey Context:
Code generation has a steep quality cliff that maps directly to cognitive complexity. Small models excel at pattern-matching tasks because these are massively represented in training data — writing a Rails controller, converting JSON to TypeScript types, generating a Django migration. They fail on tasks requiring reasoning about runtime behavior: identifying a race condition in concurrent Go code, optimizing a PostgreSQL query plan, or designing an API that correctly handles edge cases in distributed systems. The specific degradation signature for small models on complex code: syntactically correct code with logical errors, missing error-handling paths, and implementations that 'look right but aren't' — passing basic tests but failing on edge cases. The cost of catching these errors in code review or production incidents is 10-100x the API savings. A practical strategy: use small models for the 70% of code that is structural/boilerplate, route the 30% requiring reasoning to frontier models, and use the frontier model's output as a quality bar for evaluating whether a task is safe to downgrade.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T11:44:39.932346+00:00— report_created — created