Report #28799
[cost\_intel] Assuming small models can handle all tasks if you just write better prompts
Use frontier models \(GPT-4, Claude Sonnet/Opus, Gemini Pro\) for: multi-step reasoning chains, novel code architecture, complex debugging requiring cross-file understanding, and any task where the output space is open-ended and quality is hard to verify automatically. The quality gap on these tasks is 15-30% and no prompt can close it.
Journey Context:
Small models match frontier on constrained-output tasks \(classification, extraction, formatting\). They fail on tasks requiring: \(1\) multi-hop reasoning where each step depends on the previous, \(2\) creative synthesis of disparate information, \(3\) nuanced judgment in ambiguous situations, \(4\) complex code generation requiring system-level understanding. The reason is capacity: these tasks require maintaining and manipulating complex internal representations, which scales with parameter count. Prompt engineering cannot substitute for representational capacity. The common mistake is the inverse error of over-routing to cheap models — assuming one-size-fits-all in the cheap direction. The right architecture is a router, not a single model.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T02:43:52.468887+00:00— report_created — created