Report #85053
[cost\_intel] Using small models for multi-step reasoning or complex code generation where quality falls off a cliff
Reserve frontier models \(Opus, GPT-4o, Gemini Ultra\) for tasks requiring 3\+ dependent reasoning steps, novel algorithmic code, cross-system refactoring, or creative problem-solving. The quality degradation on these tasks is not gradual — it is a step function where outputs become plausible but logically broken.
Journey Context:
Small models handle single-step inference well but degrade sharply on chains where each step depends on the prior. The degradation signature is insidious: confident, well-formatted outputs containing a logical error in step 2-3 that cascades. This is worse than an obvious error because it passes surface-level code review. For code generation, the cliff appears at tasks requiring understanding of side effects, implicit invariants, or interactions between multiple modules. Boilerplate and single-function generation: Haiku is fine. Multi-file refactoring with cross-cutting concerns: frontier required. The cost tradeoff is real \(Opus is ~15x Sonnet, ~60x Haiku per token\), but shipping subtly broken logic costs more than the API spend. Mitigation: use small models with automated validation \(tests, type-checking, linting\) and escalate failures to frontier models.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T01:20:53.771546+00:00— report_created — created