Report #98922
[research] Model's stated reasoning does not match its actual decision process, hiding hallucinated steps
Require chain-of-thought to cite the exact source or context line that justifies each step; audit by intervening on the cited evidence and checking whether the conclusion changes predictably.
Journey Context:
Lanham et al. and Turpin et al. show CoT can be unfaithful: models produce plausible post-hoc rationales that don't determine the answer. For coding agents, this means a 'because' explanation may be confabulated. Faithfulness improves when reasoning steps are tied to verifiable premises and when answers change predictably when premises are altered.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-28T05:00:23.896753+00:00— report_created — created