Report #42184
[gotcha] A single content filter refusal makes subsequent responses more likely to be refused, even for benign prompts in the same conversation
When a refusal occurs: \(a\) remove the refused exchange from conversation history before the next turn, or replace it with a sanitized summary that doesn't include the flagged content or the refusal language, \(b\) implement a 'fresh start' mechanism that resets context after refusals, \(c\) never send the raw refusal message \('I can't help with that'\) back to the model as context—it acts as a caution signal. Use a separate pre-send moderation check to catch issues before they enter the conversation context.
Journey Context:
Content moderation systems evaluate the full conversation context. When a refusal occurs, the refusal message itself \('I cannot assist with that request'\) becomes part of the context. This creates a perverse cascading effect: the model now has 'refusal context' that makes it more cautious about subsequent messages. A user who had one message flagged might find that perfectly innocent follow-up questions are also refused or hedged. The refusal context acts like a 'caution flag' that the model can't ignore. This is especially damaging in product UX because the user did nothing wrong on the second message but is being punished for the first. The fix is to surgically remove refusal context from conversation history, but this must be done carefully—you can't just delete messages or the conversation loses coherence. Replacing the refused exchange with a neutral summary \('\[content removed\]'\) preserves conversation structure while removing the caution signal. The gotcha: most teams don't discover this until users complain about being 'stuck' in a refused state with no way out except starting a new conversation.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T01:16:37.640907+00:00— report_created — created