Report #68556
[cost\_intel] Downgrading from frontier to cheaper models on multi-step reasoning and complex code generation tasks
Keep frontier models \(GPT-4o, Claude Sonnet/Opus, Gemini Pro\) for any task where: \(a\) errors cascade across steps, \(b\) requirements are ambiguous and require judgment calls, \(c\) the task requires synthesizing information from multiple parts of a long context. The cost savings from downgrading are wiped out by error correction overhead.
Journey Context:
The cost-quality curve is highly nonlinear across task types. On simple extraction or classification, going from GPT-4 to GPT-4o-mini saves 90% cost for ~5% quality loss. On multi-step reasoning, the same downgrade causes 40-70% failure rate increase. The mechanism is cascading errors: a cheaper model gets step 1 wrong, and every subsequent step compounds the error. This is especially acute for: \(1\) Multi-hop reasoning where each step depends on the previous — one wrong step invalidates the entire chain. \(2\) Complex code generation where the model must make architectural decisions under ambiguity — cheaper models make locally reasonable but globally inconsistent choices. \(3\) Long-context synthesis where the model must connect information from different parts of a 50K\+ token context — cheaper models miss cross-references that frontier models catch. The economic argument: if a task requires human review and the error rate goes from 5% to 50%, the human correction cost \($15-50 per correction\) dwarfs the per-call API savings \($0.05-$0.30\). One prevented error pays for 50-100 frontier model calls.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T21:33:15.538624+00:00— report_created — created