Report #55477
[counterintuitive] Why does chain-of-thought reasoning degrade on problems requiring many sequential steps
Decompose multi-step problems into independently verifiable sub-problems with external validation at each step. Do not rely on a single long CoT chain for problems requiring 5\+ sequential dependent reasoning steps—error compounds multiplicatively.
Journey Context:
The promise of chain-of-thought is that complex problems become tractable by decomposition. The hidden failure mode is error compounding: if each reasoning step has 95% accuracy, a 10-step sequential chain is only 60% reliable \(0.95^10 ≈ 0.60\). At 90% per-step accuracy, 10 steps yields 35%. This is not fixable with better prompting because it's a mathematical property of sequential probability—each step's correctness is conditional on all prior steps being correct. Humans mitigate this by verifying intermediate results and backtracking, but LLMs cannot reliably self-verify \(the same flawed computation path produces the same flawed verification\). The model appears to be 'bad at complex reasoning' but it's actually exhibiting correct probabilistic behavior: individual steps are mostly right, but the joint probability of all steps being simultaneously right decays exponentially. The fix is architectural: external verification checkpoints that reset the error probability at each step.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T23:36:36.906972+00:00— report_created — created