Report #28820
[gotcha] Single-turn prompt filters bypassed by multi-step many-shot attacks
Implement sliding context windows or summarization of older turns to prevent the model from retaining massive amounts of adversarial few-shot examples, and enforce content policies on the entire conversation history, not just the latest turn.
Journey Context:
Safety filters often look at the immediate prompt. The 'Many-shot jailbreak' floods the context window with dozens of fake dialogue turns demonstrating the model answering harmful questions. By the time the actual harmful question is asked, the model's alignment is overridden by the local context of the fake examples. Limiting context length disrupts this attack.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T02:46:08.278031+00:00— report_created — created