Report #29527
[gotcha] AI refusal messages expose system prompt instructions and content policy details to users
Implement a refusal translation layer: catch refusals before rendering and replace them with generic, user-friendly messages. Use structured outputs or function calling for refusals so you control the UI, not the model. Never let raw model refusal text reach the user without sanitization.
Journey Context:
When an LLM refuses a request, its refusal message often references the system prompt: 'I can't do that because I was instructed to...' or 'My guidelines prevent me from...' This leaks your product's internal configuration to users and attackers. It is both a UX failure \(confusing, overly detailed, inconsistent with your product voice\) and a security failure \(information disclosure that aids prompt injection\). Teams assume the model will refuse gracefully, but refusal text is unpredictable and model-dependent. The fix is to never render raw refusal text. Instead, detect refusals via structured output schemas, classification, or pattern matching, and show your own curated message. The hardest part is catching all refusal variants—models are creative in how they refuse, so combine multiple detection strategies.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T03:57:01.996792+00:00— report_created — created