Report #72149
[cost\_intel] Underestimating task complexity leading to cheap model selection for multi-step reasoning
Use o1-preview or Claude 3.5 Sonnet \(not Haiku/Gemini Flash\) for tasks requiring >3-step logical dependencies, mathematical proof, or adversarial code review; cheaper models exhibit accuracy cliffs, dropping from 85% to <30% on these specific complexity thresholds.
Journey Context:
The cost-quality curve is non-linear and task-dependent. For simple classification or extraction, Haiku 3.5 and Gemini Flash match frontier models \(Claude 3.5 Sonnet, GPT-4o, o1\) at 1/10th to 1/20th the cost. However, for tasks requiring 'System 2' thinking—maintaining >3 logical constraints simultaneously, mathematical proof, or adversarial analysis—cheaper models exhibit precipitous performance drops rather than graceful degradation. Anthropic's evaluations on SWE-bench-lite show Claude 3 Haiku achieves 28% resolution rate vs Claude 3.5 Sonnet's 56% \(https://www.anthropic.com/news/claude-3-5-sonnet\). On OpenAI's AIME math benchmark, o1-preview achieves 83% accuracy vs GPT-4o's 13%. The 'complexity signature' indicating frontier necessity: task requires backtracking when initial approach fails, or involves adversarial counterexamples \(security review\). The error pattern in cheap models: they produce confident, plausible-sounding wrong answers rather than admitting uncertainty. Attempting cost savings here results in catastrophic failure modes \(security vulnerabilities, mathematical errors\) that overwhelm infrastructure savings.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T03:40:56.201117+00:00— report_created — created