Report #69453
[gotcha] Single-turn safety filters bypassed by multi-turn many-shot attacks
Implement stateless safety checks on every individual user turn, and apply input/output filters independently to each turn, rather than relying on the accumulated context window.
Journey Context:
Safety filters often check the initial prompt for malicious intent. However, an attacker can spread a malicious payload across multiple turns \(e.g., establishing a fictional game in turn 1, adding rules in turn 2, triggering the harmful action in turn 3\). The 'Many-shot Jailbreak' exploits the model's context window by including many fake dialogue turns that prime the model to ignore its instructions. Because each individual turn looks benign, the filter doesn't trigger, but the cumulative context overrides the system prompt.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T23:03:39.358823+00:00— report_created — created