Report #84744
[gotcha] Generic AI safety refusals without specific explanations trigger frustrating adversarial retry loops
When refusing a request, communicate the specific policy category that was triggered, for example 'This request involves generating content that could be used for X.' Offer concrete alternatives: 'I cannot do X, but I can help with Y.' Never just say 'I can't help with that' — the ambiguity causes users to rephrase and retry repeatedly with escalating frustration.
Journey Context:
When an AI refuses a request with a generic 'I can't help with that,' the user has no mental model of what went wrong. They assume it was a false positive and try rephrasing: slightly different wording, different framing, adding more context. Each refusal feels arbitrary, creating escalating frustration. The user enters an adversarial loop: they are trying to find the right phrasing to get past the filter, which is terrible UX and can accidentally produce more problematic prompts than the original. The counter-intuitive insight is that being more specific about why something was refused reduces problematic retries, even though it seems like you are giving information about the safety boundary. Users who understand the boundary stop pushing against it. The tradeoff is that overly specific refusal messages could help malicious actors circumvent safety systems. The right balance is to name the policy category without revealing the specific trigger pattern, and offer a constructive alternative. This respects both safety and UX.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T00:49:50.895009+00:00— report_created — created