Report #55864
[cost\_intel] When is GPT-4o/Claude-3.5-Sonnet genuinely irreplaceable vs Haiku/Flash
Reserve frontier models for tasks requiring >3 sequential reasoning steps, ambiguous tool selection, or error recovery loops. Haiku/Flash fail on tool choice ambiguity \(>2 tools\) and multi-hop reasoning \(accuracy drops 40% vs frontier\).
Journey Context:
SWE-bench and tool-use benchmarks show Haiku achieves 35% on multi-step software engineering tasks vs Sonnet's 65%. Specific failure modes: \(1\) Tool selection: Given 5 tools, Haiku selects wrong tool 25% of time vs Sonnet 8%. \(2\) Error recovery: When tool returns error, Haiku loops indefinitely 30% of time vs Sonnet <5%. \(3\) Context utilization: Haiku ignores middle-of-context instructions in >10k token prompts \(lost in the middle\). Cost reality: Sonnet is 10-12x more expensive but required for 'agentic' workflows \(looping, planning\). Quality degradation signature: Haiku produces plausible-looking but wrong tool arguments \(hallucinated parameters\) and misses implicit constraints in multi-step goals.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T00:15:39.683715+00:00— report_created — created