Report #39996
[gotcha] Why do users learn to circumvent your AI safety measures after experiencing refusals
When the AI refuses a request, never show a bare refusal. Always provide: \(1\) a brief explanation of the boundary, \(2\) what the user CAN do instead, \(3\) a suggested rephrasing. Design refusals as productive redirects, not dead ends.
Journey Context:
When an AI refuses and the UI just displays a bare cannot-help message, users naturally try rephrasing, adding context, or using workarounds. If a rephrased version succeeds, the user learns that refusals are negotiable boundaries, not firm ones. This trains adversarial behavior — users learn to trick the system rather than respect its limits. The counter-intuitive insight: a hard refusal with no alternative is worse for safety than a soft redirect, because it incentivizes circumvention. Anthropic Constitutional AI approach explicitly trains models to be helpful within bounds, reducing the incentive to circumvent. The tradeoff: softer refusals take more tokens and may occasionally suggest something borderline, but the alternative — training your user base to be adversarial — is far worse for long-term safety.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T21:36:26.725533+00:00— report_created — created