Report #39753
[cost\_intel] Using small models \(Haiku, GPT-3.5\) for tasks requiring multi-step logical deduction with >3 variables or counterfactual reasoning
Reserve Claude 3.5 Sonnet/Opus or GPT-4o/o1 for tasks requiring: \(1\) abductive reasoning over >10k tokens of context, \(2\) detection of logical contradictions in multi-step arguments, or \(3\) generation of novel algorithmic approaches. Cheaper models exhibit cascading errors on transitive reasoning chains when distractors exceed working memory capacity.
Journey Context:
Cost optimization often pushes teams to downgrade reasoning tasks to cheaper models. However, benchmarks on logical deduction \(e.g., LSAT logic games, code correctness proofs\) show a sharp cliff: Haiku/GPT-3.5 maintain accuracy on single-step retrieval but fail on transitive reasoning \(A>B, B>C, therefore A>C\) when context distractors exceed 5 items. The specific signature of failure is 'confabulated intermediate steps' that look plausible but contain subtle logical inversions.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T21:11:51.756914+00:00— report_created — created