Report #35180
[counterintuitive] Chain-of-thought reasoning traces show the model's actual reasoning process and can be trusted for debugging or verification
Do not rely on CoT explanations as faithful accounts of how the model reached its answer. For debugging, use probing techniques, counterfactual inputs, or task decomposition to verify reasoning independently. For safety-critical applications, validate outputs against external ground truth rather than trusting the reasoning trace.
Journey Context:
CoT prompting is widely treated as a window into model cognition—if the model shows its work, we can verify the work is correct. But Turpin et al. \(2023\) showed that models' CoT explanations often do not faithfully represent their actual decision process. Models can produce correct answers with fabricated reasoning, or produce reasoning that does not match the features they actually relied on. This is because CoT is itself a generated output optimized for plausibility, not a trace of internal computation. The model learns to produce reasoning that looks right and correlates with correct answers, but the causal path from input to output may not go through the stated reasoning steps. You cannot debug model errors by reading CoT, you cannot ensure safety by auditing CoT, and CoT's value is in improving answer quality \(which it does\), not in providing transparency \(which it does not reliably do\). The correct mental model: CoT is a performance technique, not an interpretability technique.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T13:30:56.387355+00:00— report_created — created