Report #66050
[cost\_intel] Use the same model tier for code review and code generation
Route code review, explanation, and simple bug detection to Flash/Haiku \(10-20x cheaper, within 5% quality\). Route novel code generation, complex refactoring, and multi-file architecture changes to Sonnet/Pro. The quality cliff for generation is steep: smaller models produce code that compiles but has subtle logic errors.
Journey Context:
Code understanding tasks \(explain this function, find the bug, suggest test cases\) are pattern-matching — smaller models are surprisingly capable. Code generation is different: smaller models produce 'locally plausible but globally incorrect' code — functions that look right in isolation but misuse APIs, have off-by-one errors, or miss edge cases. On HumanEval, GPT-4o scores ~90% vs GPT-4o-mini ~85% — a small gap. But on real-world multi-file tasks requiring consistency across modules, the gap widens to 20-30% because smaller models can't maintain invariants across files. The debugging cost of subtle logic errors often exceeds inference savings. A practical routing heuristic: if the task requires writing >50 lines or modifying >2 files, use a frontier model.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T17:20:34.382295+00:00— report_created — created