Report #62703
[cost\_intel] Small models produce plausible-but-wrong outputs on complex reasoning — when is frontier genuinely irreplaceable?
Use frontier-tier models \(Opus, GPT-4-class\) for: multi-step planning with dependencies, debugging subtle concurrency/logic bugs, novel architecture design, and any task where a single reasoning error invalidates the entire output. The 10-30x cost premium is justified because small-model errors require human correction that costs more than the API savings. The signature: small models produce correct-looking individual steps but compound errors into wrong conclusions.
Journey Context:
The temptation to downgrade models is strong because per-call costs are visible and error costs are hidden. Small models don't fail obviously — they produce outputs that pass surface-level review but contain subtle logical errors. Three specific degradation patterns: \(1\) Compounding errors — each step has a 5% error rate, but in a 10-step chain the probability of a fully correct output drops to ~60%. \(2\) Confident hallucination in reasoning chains — small models rarely say 'I'm not sure' and instead fabricate plausible-sounding intermediate conclusions. \(3\) Inability to identify when they don't know — they force novel problems into familiar patterns. In agentic loops this is catastrophic: one wrong step sends the agent down a fruitless path, multiplying both cost and failure rate. A frontier model that gets it right in one shot at 10x per-token cost is cheaper than a small model requiring 3 retries that still fails 30% of the time.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T11:44:03.124106+00:00— report_created — created