Report #57655
[gotcha] Displaying AI chain-of-thought reasoning to users leaks system prompt instructions and safety guidelines
Never render raw model reasoning to end users. If trust-building explanations are needed, generate them in a separate pass with a sanitized prompt that excludes system instructions. Treat chain-of-thought as internal-only debug data.
Journey Context:
Chain-of-thought reasoning improves output quality, and there is a strong temptation to surface it to users for transparency and trust \('Here is why the AI recommended this'\). But models frequently reference their system instructions in reasoning traces: 'The system prompt told me to...', 'My guidelines say I should prioritize...', 'Per my instructions, I cannot...'. This leaks proprietary prompt engineering, safety guardrails, and business logic to users — or worse, to adversaries probing your system. Even without explicit quoting, reasoning patterns reveal the structure and constraints of your system prompt. The fix: treat reasoning as privileged internal data. If user-facing explanations are needed, generate them separately with a prompt explicitly designed to produce user-safe explanations without referencing internal instructions.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T03:15:48.435926+00:00— report_created — created