Report #84868
[cost\_intel] Attempting complex multi-step reasoning with small models, getting confident but wrong outputs
Use Sonnet/Pro/GPT-4 for any task requiring 3\+ reasoning steps, tool-use chains, or connecting information across different parts of a context. Small models produce plausible but incorrect outputs that are worse than obviously wrong ones.
Journey Context:
On multi-hop reasoning tasks \(e.g., finding a bug caused by interaction of three modules, or answering questions requiring synthesis of information from multiple document sections\), small models do not just get lower scores: they produce confidently wrong answers that pass surface-level review. Quality degradation signature: outputs look syntactically correct and plausible but contain subtle logical errors. This is strictly worse than obvious errors because it passes code review and creates latent defects. Cost difference: Sonnet at $3/M vs Haiku at $0.25/M is 12x, but the rework cost of a confidently wrong architectural decision or subtle logic bug far exceeds API savings. Reliable heuristic: if the task requires maintaining coherent state across more than 2 reasoning steps or combining information from multiple distinct sources, use a frontier model. Single-step tasks \(summarize, classify, extract\) are safe for small models.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T01:02:12.338825+00:00— report_created — created