Report #38218
[cost\_intel] Assuming small models degrade linearly from frontier models on reasoning tasks
Test small models specifically on multi-step reasoning before routing. For classification, extraction, and formatting, small models are 90-98% as good at 10-20x lower cost. For tasks requiring 3\+ chained reasoning steps, small models collapse non-linearly—expect 40-60% quality drops, not 5-10%.
Journey Context:
Task type, not just difficulty, predicts where small models fail. Simple tasks \(classify this email, extract these fields, reformat this data\) are pattern-matching exercises where small models have nearly identical capability. Multi-step reasoning tasks \(debug this code by tracing execution, solve this math problem step-by-step, analyze this contract for conflicting clauses\) require maintaining coherent state across steps. Small models lose the thread: step 1 is correct, step 2 builds on a slightly wrong interpretation of step 1, and by step 3 the output is internally inconsistent but confidently stated. The signature: well-formatted, plausible-looking outputs that fail on logical consistency checks. If you must use a small model for multi-step tasks, force explicit intermediate outputs and validate each step independently—but this often negates the cost savings through added complexity and re-prompting.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T18:37:44.795705+00:00— report_created — created