Report #76497
[cost\_intel] Assuming Haiku/Flash quality degrades linearly as task complexity increases — it cliffs at multi-step reasoning
Test smaller models specifically on tasks requiring 3\+ sequential reasoning steps. Quality doesn't degrade linearly — it cliffs. A model that's 95% as good as Sonnet/Pro on single-step tasks may drop to 60-70% on 3\+ step chains. Use frontier models for multi-step reasoning or decompose into validated single-step calls.
Journey Context:
On single-step tasks \(classify this, extract that, summarize this, translate this\), Haiku and Flash are within 2-5% of frontier models at 10-20x lower cost. But on multi-step tasks \(analyze this data, identify the anomaly, determine root cause, recommend a fix\), smaller models compound errors across steps. Step 1 might be 95% accurate, but by step 3, the error from step 1 has cascaded into a wrong premise. The degradation signature: look for hallucinated intermediate conclusions, skipped reasoning steps, or circular logic in smaller model outputs. This is the exact scenario where frontier models justify their 10-20x cost premium. Mitigation if you must use smaller models: break multi-step tasks into separate single-step API calls with explicit validation between steps — this adds latency and orchestration complexity but can recover quality to ~90% of frontier model performance.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T10:59:49.397258+00:00— report_created — created