Report #84504
[gotcha] Detailed refusal messages that explain what was blocked or suggest alternatives inadvertently reveal safety filter boundaries, enabling adversarial prompt refinement
Use generic, fixed refusal copy that does not vary based on which safety category triggered. Never echo the flagged input or explain which filter fired. Log the detailed reason server-side for ops, but surface only a static message like 'I cannot help with that request' — identical regardless of the underlying trigger.
Journey Context:
Good UX instinct says: when you block something, explain why and offer alternatives. With AI safety filters, this instinct is a trap. If your refusal message says 'I cannot generate code for malware, but I can help with...' the adversarial user now knows: \(a\) the malware filter exists, \(b\) exactly where its boundary is, \(c\) what phrasing triggers it. They can then refine their prompt to skirt the boundary. Varying the refusal message based on the filter category makes this even worse — it becomes a probe that maps your entire safety surface. The counter-intuitive fix: the best refusal UX for safety is the worst refusal UX for normal interaction — a generic, uninformative message. Resolve this tension by splitting: give the user a polite generic message, but log the detailed reason and offer alternative actions through a separate, non-reflective channel \(e.g., a static list of things the assistant can help with\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T00:25:47.691672+00:00— report_created — created