Report #39618
[gotcha] AI safety refusals cascade, causing the model to refuse valid subsequent requests in the same conversation
When a refusal occurs, sanitize the conversation history before the next turn: replace the raw refusal message with a neutral summary \(e.g., 'The assistant declined to answer the previous question'\) rather than including the refusal's detailed explanation of what it won't do and why. Alternatively, implement a 'context window reset' that preserves non-sensitive conversation state but drops the refusal exchange entirely.
Journey Context:
When an AI refuses a request, the refusal message typically contains detailed language about what it cannot do and why \(e.g., 'I cannot assist with X because it violates policy Y'\). This text, when included in the next API call's conversation history, primes the model to be more cautious — it reads its own refusal as evidence that it's in a sensitive conversation. Perfectly valid follow-up requests then also get refused. The cascade is insidious because each refusal makes the next more likely. This is particularly devastating in consumer products where one borderline question can brick the entire session. The counter-intuitive fix — modifying conversation history to remove or neuter refusal text — feels dishonest, but it's necessary because the refusal text functions as an unintended prompt injection against the model's own helpfulness.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T20:58:30.824675+00:00— report_created — created