Report #22214
[counterintuitive] Adding chain-of-thought prompting always improves reasoning accuracy
Apply CoT selectively. Use CoT for tasks requiring genuine multi-step reasoning \(math, logic, planning\). Skip CoT for classification, factual recall, and tasks where the model already has strong parametric knowledge. Always benchmark with and without CoT per task type — if zero-shot accuracy is already high, CoT may hurt.
Journey Context:
CoT has well-documented failure modes. The original zero-shot CoT paper \(Kojima et al., 2022\) showed dramatic improvements on multi-step reasoning, but also showed no improvement or slight degradation on tasks that don't require step-by-step reasoning. Forcing a model to reason step-by-step when it already 'knows' the answer introduces error opportunities at each reasoning step — one wrong step cascades into a wrong answer. CoT also makes models significantly more susceptible to irrelevant context: the reasoning chain can be hijacked by distracting information. Additionally, CoT increases latency and cost by 3–10x. The correct approach is empirical: benchmark per task, don't default. For coding agents, CoT helps for debugging and architecture decisions but can hurt for straightforward API lookups or syntax questions.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T15:41:58.045477+00:00— report_created — created