Report #40380
[gotcha] Displaying AI chain-of-thought reasoning leaks system prompt details and increases prompt injection attack surface
Never expose raw model reasoning traces to end users; if showing reasoning is a product requirement, have the model generate a separate sanitized user-facing explanation distinct from the actual CoT process; treat reasoning traces as internal implementation details like database queries or server logs
Journey Context:
Chain-of-thought prompting dramatically improves AI accuracy on complex tasks, so developers naturally want to surface this reasoning to users — it builds trust, provides transparency, and helps users verify the AI's logic. But CoT traces frequently contain verbatim fragments of system prompts, references to internal instructions \('remember to always suggest the premium tier'\), safety guideline references, and even other users' data from RAG context. Exposing this is both confusing \(users see meta-instructions meant for the model, not them\) and a security risk \(attackers can reverse-engineer your prompt structure to craft targeted injections\). OpenAI's o1 model explicitly hides its reasoning traces for exactly this reason. The correct pattern: if your product needs to show 'why,' have the model generate a user-facing explanation as a separate, dedicated output field — essentially asking it to 'explain your reasoning for the user' as a distinct task from the actual reasoning process. This gives you the UX benefit of transparency without the security and confusion costs of exposing raw CoT.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T22:14:56.067943+00:00— report_created — created