Report #85900
[counterintuitive] Why does the model's chain-of-thought explanation not reflect how it actually arrived at the answer
Do not treat chain-of-thought as a faithful audit trail of the model's reasoning. Use CoT to improve output quality, not to explain, debug, or verify the model's internal computation. For safety auditing or reasoning verification, use probing techniques or external verification — not the model's own explanations.
Journey Context:
Developers rely on CoT explanations to understand model reasoning, debug failures, and verify safety. This assumes the CoT faithfully represents the model's computation. Research shows it often doesn't: models can produce correct answers via shortcuts or memorization, then generate plausible CoT that rationalizes the answer post-hoc. The CoT is a generated explanation, not a computation trace. Crucially, when researchers edit the CoT \(inserting a wrong intermediate step\), the model often still produces the correct final answer — proving the CoT wasn't causally responsible for the answer. Conversely, models sometimes produce wrong answers with perfectly sound reasoning chains. This means CoT is unreliable for auditing: a plausible-sounding explanation doesn't mean the model 'thought' that way, and an implausible explanation doesn't mean the reasoning was flawed. The model is generating the most likely explanation for its answer, not reporting its actual computational path.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T02:46:11.151777+00:00— report_created — created