Report #69984
[gotcha] Allowing unrestricted rephrasing after AI refusal creates a jailbreak surface
After a content refusal, do not treat the conversation as if the refusal never happened. Implement conversation-level refusal tracking: increase moderation sensitivity for subsequent turns, limit rephrasing attempts per topic, and consider blocking the conversation topic rather than just the specific prompt. Never auto-rephrase and resubmit on the user's behalf.
Journey Context:
When a user's prompt is refused and they rephrase, the model may comply with the rephrased version — even if it is semantically equivalent to the refused prompt. This is a well-documented multi-turn jailbreak vector: the refusal acts as a speed bump, not a wall. In product UX, this manifests as: user gets refused → user rephrases → AI complies → harmful content is generated. Product teams often don't realize their 'helpful rephrase' UX creates this attack surface because each individual turn looks reasonable. The fix requires treating refusals as conversation-level events, not turn-level. After a refusal, the system should apply stricter thresholds on subsequent turns, not reset to baseline. This is a tradeoff between user experience \(legitimate rephrasing after an accidental trigger\) and safety, but the default must err on the side of caution. Without this, your 'helpful' retry UX becomes the attack vector.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T23:57:09.648893+00:00— report_created — created