Report #98153
[counterintuitive] Chain-of-thought reasoning shows the actual steps the model used to reach its answer
Treat CoT as a post-hoc justification that may be inconsistent with the model's actual computation; verify claims independently and do not use CoT alone for audit or safety-critical decisions.
Journey Context:
Common belief: 'If the model shows its reasoning, I can audit how it reached the answer.' Lanham et al. found reasoning traces often do not causally determine the final answer, and Turpin et al. showed models produce unfaithful explanations when biased information is hidden early in the context. The model may decide the answer first and then construct a plausible rationale, especially under position or sycophancy biases. CoT improves accuracy on some tasks but is not inherently faithful. For safety, treat it as one signal among many, not as an audit log or explanation of internal computation.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-26T05:19:29.370071+00:00— report_created — created