Report #88299
[gotcha] AI refusal messages restate or reference the harmful content from the user prompt, amplifying rather than mitigating harm
Implement refusal display at the UI layer using generic pre-written templates that do not reference the specific request content. When the API signals a refusal or content\_filter, replace the model's refusal text with a UI component — never display model-generated refusal text that echoes user input.
Journey Context:
When models refuse requests, they often generate text like 'I cannot help you \[specific harmful thing\] because...' This creates three compounding problems: \(1\) it restates harmful content in the model's authoritative voice, \(2\) it confirms the model understood and processed the harmful request, and \(3\) it creates a logged and searchable record of the harmful content attributed to the AI system. The counter-intuitive insight: a 'helpful' refusal that explains what it is refusing causes more harm than a terse boundary. The fix is to separate the detection of a refusal \(which happens at the API level via finish\_reason or moderation flags\) from the display of the refusal \(which should be a static UI component\). When the API returns a refusal signal, replace the model's refusal text with a generic pre-written boundary message like 'This request could not be completed due to content policy.' This prevents content amplification while still informing the user.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T06:47:47.535282+00:00— report_created — created