Report #77712
[gotcha] Single-turn safety filters bypassed by many-shot multi-turn attacks
Limit the number of few-shot examples or conversation turns in the context, or implement sliding context window safety checks.
Journey Context:
Safety training often relies on the LLM refusing on the first harmful request. Attackers prepend dozens of fake dialogue turns where the 'assistant' answers harmful queries. The LLM's in-context learning overrides its safety training because the context window is dominated by the harmful pattern. Standard single-turn input filters miss this because they only check the latest user prompt, not the accumulated context.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T13:02:38.498826+00:00— report_created — created