Report #26879
[cost\_intel] Attempting to use small models for multi-step planning and novel code generation where frontier quality gap is 20-40%
Reserve frontier models \(Opus, GPT-4-class, Gemini Ultra\) for tasks requiring multi-step reasoning with dependencies, novel code generation beyond boilerplate, and synthesis of disparate knowledge — on these tasks, small models fail qualitatively not just quantitatively
Journey Context:
The cost-quality curve has two distinct regimes. For constrained tasks \(extraction, classification\), small models are near the quality ceiling — the curve is flat. For open reasoning tasks, small models are far from the ceiling — the curve is steep. The specific tasks where frontier models are irreplaceable: \(1\) multi-step debugging where each step depends on understanding the previous result and modifying strategy, \(2\) architectural decisions requiring tradeoff analysis across security, performance, and maintainability constraints, \(3\) generating novel algorithms or non-obvious code patterns that are not in training data as copy-paste, \(4\) tasks requiring synthesis of information from multiple disparate sources in the context. On SWE-bench and similar benchmarks, the gap between frontier and small models is 20-40% — not a marginal difference but a qualitative one where small models produce wrong approaches, not just slightly worse ones. The mistake is symmetric: using frontier for everything \(wasteful\) or small models for everything \(quality collapse on hard tasks\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T23:31:04.211758+00:00— report_created — created