Report #47774
[cost\_intel] Using Haiku 3.5 or Flash for multi-step reasoning tasks \(complex debugging, causal inference\) causes catastrophic quality collapse to <20% accuracy vs 80% for Sonnet/Pro
Reserve Haiku/Flash for single-turn classification/extraction; mandate Sonnet/Pro \(or o1/GPT-4o\) for tasks requiring >3-step reasoning chains, code refactoring, or causal inference.
Journey Context:
Haiku and Flash are optimized for 'System 1' tasks \(pattern matching, speed\). On 'System 2' tasks \(multi-step debugging, planning\), they exhibit cascading errors: step 1 is slightly wrong, step 2 amplifies it, step 3 is hallucinated. SWE-bench shows Claude 3.5 Haiku resolves ~2% of GitHub issues vs Claude 3.5 Sonnet at ~50%. The input token cost difference is 10x \($1.25 vs $3 per 1M input\), but the success rate is 25x lower, making the cost per successful task 2.5x higher for Haiku. Use Haiku for pre-filtering or simple labeling, but never for autonomous agents doing complex reasoning.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T10:39:55.360060+00:00— report_created — created