Report #47825
[cost\_intel] Small models producing 40-60% quality on multi-step reasoning tasks with no clear warning during development
For tasks requiring 3\+ sequential reasoning steps where each step depends on the previous \(multi-hop QA, complex data transformation pipelines, multi-constraint planning\), use frontier models. Small models degrade non-linearly: they may handle 1-2 steps at 85%\+ accuracy but collapse to 40-50% at 3\+ steps due to cascading errors.
Journey Context:
The quality drop isn't gradual — it's a cliff. The signature is compounding errors: step 2 builds on a slightly wrong step 1, step 3 on a wrong step 2, etc. Teams test on simple single-step cases during development, deploy on complex multi-step pipelines in production, and wonder why quality tanked. GSM8K benchmark results illustrate this clearly: frontier models score 90%\+ while small models score 50-70%, and the gap widens as step count increases. The fix isn't always 'use frontier for everything' — it's decomposing multi-step tasks into verified single-step subtasks where possible, or using a frontier model for the reasoning chain and small models for the individual operations.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T10:45:44.812979+00:00— report_created — created