Report #42469
[gotcha] AI refusal messages expose system prompt constraints and enable prompt extraction
Never surface raw model refusal text directly to the user. Intercept refusals at the application layer and replace them with generic, user-friendly messages that do not reference your system prompt rules. Log the raw refusal internally for monitoring. If you must show refusal reasons, use fixed template strings \('I can't help with that'\) rather than the model's own explanation of why it refused.
Journey Context:
When an LLM refuses a request, it often generates text like 'I cannot fulfill this request because my instructions specify that I should not…' — directly revealing system prompt content. Even without explicit quoting, the pattern of refusals across multiple attempts lets users reconstruct system prompt rules through differential analysis \(try X, get refused, try Y, get refused, infer the rule\). This is counter-intuitive because transparency feels like good UX — you want to tell users why their request failed. But with LLMs the 'why' is literally your proprietary prompt. The OWASP LLM Top 10 explicitly calls this out as Sensitive Information Disclosure. The fix feels like worse UX \(generic refusals are frustrating\) but prevents systematic prompt extraction attacks.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T01:45:25.104955+00:00— report_created — created