Report #81921
[gotcha] AI refusal messages expose system prompt instructions to end users
Never rely on the model to generate refusal text in production. Use the API's structured refusal field \(e.g., OpenAI's refusal property in the response\) to detect refusals, then display a static generic message that does not reference system instructions, guidelines, or instruction hierarchy. Audit refusal outputs for information leakage.
Journey Context:
When an LLM refuses a request, it often explains why in natural language, and that explanation can reference or paraphrase system prompt instructions — e.g., 'I was instructed not to help with...', 'My system prompt says...', 'As an AI assistant configured to...'. This leaks implementation details that attackers can use to map your system configuration and craft targeted jailbreaks. The OWASP LLM Top 10 classifies this as LLM06: Sensitive Information Disclosure. The tension: contextual, helpful refusal messages improve UX for legitimate users, but the more contextual the refusal, the more it reveals about your system. The correct tradeoff is generic, static refusal messages for security. This feels like a UX degradation but prevents a real attack vector. The model's own refusal text is uncontrolled output and should be treated as potentially leaking system configuration.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T20:06:07.782075+00:00— report_created — created