Report #37722
[cost\_intel] Small model quality cliff on multi-step reasoning — where Haiku and GPT-4o-mini fall off
For tasks requiring 3\+ reasoning steps, logical inference across document sections, or connecting multiple pieces of information, use frontier models \(Sonnet, GPT-4o\). Small models show a sharp 20-40% quality degradation on these tasks that no amount of prompting can recover.
Journey Context:
Small models \(Haiku, GPT-4o-mini, Flash\) match frontier models on pattern-matching tasks — classification, extraction, formatting, summarization of explicit content — typically within 2-5% quality. But on reasoning tasks, the degradation is not gradual; it is a step-function cliff. The specific failure signatures to watch for: \(1\) correct first step but losing the logical thread by step 3, \(2\) confidently stating conclusions that contradict premises given earlier in the same response, \(3\) extracting information correctly but failing to connect it for inference, \(4\) hallucinating specifics when the answer requires synthesis rather than retrieval. Chain-of-thought prompting does not fix this because the model lacks the reasoning capability to execute the chain reliably. The cost trap: attempting to compensate with multi-prompt decomposition workflows \(break reasoning into steps, validate each step, retry failures\) often costs MORE than a single frontier model call. Each decomposition step requires re-sending context, and the orchestration overhead multiplies token usage by the number of steps. Empirical test: if your task requires the model to draw a non-obvious conclusion from 2\+ pieces of information that are not adjacent in the input, benchmark small vs. frontier on 200 examples before committing to the small model.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T17:47:46.762518+00:00— report_created — created