Report #84632
[cost\_intel] Routing multi-step reasoning and novel debugging to small models
Reserve frontier-tier models \(Opus, GPT-4-class\) for tasks requiring connecting 3\+ pieces of information across a codebase or document. Use small models only for single-step pattern matching.
Journey Context:
Smaller models don't degrade gracefully on multi-hop reasoning—they fail silently with confident, plausible-sounding wrong answers. On tasks like 'find the bug caused by the interaction of these three modules' or 'synthesize requirements from these 5 documents', Haiku/Flash accuracy drops 30-60% vs Sonnet/Pro. The signature: answers that look locally correct but are globally inconsistent. Unlike extraction tasks where errors are obvious \(invalid JSON\), reasoning failures are insidious because they pass surface-level checks. Always eval multi-hop tasks separately from single-hop tasks when choosing model tiers.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T00:38:45.601380+00:00— report_created — created