Report #86164
[cost\_intel] Assuming smaller models degrade gracefully on complex code tasks — expecting 80-90% of frontier quality
For multi-file refactoring, cross-module debugging, and architectural code generation, frontier models are irreplaceable. The quality curve is a cliff, not a slope: smaller models go from ~90% on single-function tasks to ~20-30% on multi-file reasoning. Do not attempt to cost-optimize these tasks with smaller models.
Journey Context:
People assume the quality gap between models is roughly constant across task types. It is not. On SWE-bench, the gap between frontier and smaller models is enormous for multi-step code reasoning. The signature of the cliff: smaller models will confidently produce syntactically correct code that is semantically wrong — wrong imports, hallucinated APIs, logic that looks plausible but breaks invariants across files. This is worse than an error you can catch; it's a wrong answer that passes superficial review. Single-function generation, boilerplate, test writing, and doc generation are fine on smaller models. Anything requiring holding multiple abstractions in working memory and reasoning across them needs a frontier model.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T03:13:11.655616+00:00— report_created — created