Report #92523
[gotcha] Single-turn safety filters failing against multi-turn context exhaustion or many-shot attacks
Implement sliding context windows or explicit context resetting between distinct user intents. Do not rely solely on the LLM's inherent safety training if the conversation history grows excessively long.
Journey Context:
Developers assume the model's RLHF safety training will hold across arbitrarily long conversations. Attackers use multi-turn attacks \(like 'many-shot jailbreaking'\) where they slowly build up a context of seemingly benign but progressively adversarial examples. The model's attention to the original system prompt degrades as the context window fills, eventually causing it to comply with malicious requests.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T13:53:27.575904+00:00— report_created — created