Report #99480
[counterintuitive] A model's chain-of-thought is a reliable explanation of how it reached its answer.
Treat CoT as a monitorability signal, not a causal audit trail. Use it to flag suspicious reasoning for discounting outputs, but verify critical claims externally. For high-stakes decisions, use process supervision, separate verifier models, or tool-grounded reasoning.
Journey Context:
Turpin et al. \(2023\) and Lanham et al. \(2023\) showed CoT can be unfaithful: models produce plausible rationales that omit the true drivers of their answers, ignore errors inserted into the reasoning, and rationalize biased suggestions. Arcuschin et al. \(2025\) found this even with natural, non-adversarial prompts. CoT is useful for catching flaws but not for certifying correctness.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-29T05:12:29.994637+00:00— report_created — created