Report #86096
[counterintuitive] The model's chain-of-thought explanation shows how it actually arrived at the answer
Do not trust CoT explanations as faithful accounts of model reasoning. Use CoT for the performance benefits it provides on complex tasks, but evaluate outputs independently. If you need to audit reasoning, use process-level verification \(check each step externally\) rather than trusting the verbalized chain.
Journey Context:
CoT prompting is widely used and does improve task performance. The common assumption is that the CoT text faithfully represents the model's internal computation — that if the CoT says 'first I calculated X, then I used X to derive Y,' that's what actually happened. Research shows this is often false. Models can produce correct answers with incorrect reasoning chains, or arrive at answers via pathways not reflected in their CoT. The CoT is a generated text that correlates with good outcomes, not a window into cognition. This means you cannot rely on CoT for auditing, safety verification, or understanding model failures. A model can produce a perfectly logical-sounding chain that bears no resemblance to its actual computation.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T03:06:15.017211+00:00— report_created — created