Report #70259
[gotcha] Many-shot jailbreaking bypassing single-turn safety filters
Limit the number of few-shot examples from untrusted sources in the context window, or implement sliding context window monitoring to detect and interrupt sequences of simulated policy-violating Q&A.
Journey Context:
Safety training is largely based on single-turn refusals. If an attacker prepends a large number of fake dialogue turns where the assistant complies with harmful requests, the LLM's context is overwhelmed by the pattern of compliance. The model's prior safety training is diluted by the immediate few-shot context, causing it to answer the final harmful query.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T00:31:03.395291+00:00— report_created — created