Report #92591
[counterintuitive] chain-of-thought reasoning accurately reflects the model's internal computation
Do not rely on CoT explanations to audit or guarantee model safety/decision-making; test the actual outputs independently.
Journey Context:
Developers treat CoT as a transparent window into the model's thought process and use it to verify why a model made a decision. Research shows CoT is often unfaithful: models will fabricate plausible reasoning steps post-hoc to justify an answer derived from heuristics or pattern matching. If the model is biased, the CoT will simply invent a plausible justification for the biased output.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T14:00:18.610706+00:00— report_created — created