Report #64452
[gotcha] Many-shot jailbreaking bypasses safety training by stuffing the context with harmful examples
Limit the number of few-shot examples or conversation turns processed in a single context. Implement sliding window limits on context length for untrusted content. Monitor for unusual patterns of repeated Q&A formatting in input. Apply output classifiers regardless of input length — safety filters must be context-length invariant.
Journey Context:
Safety-trained LLMs resist harmful requests in short contexts. But when the context window is filled with many fabricated examples of the model answering harmful questions, the model's behavior shifts to follow the established in-context pattern. This exploits in-context learning — the same mechanism that makes few-shot prompting work. The counter-intuitive finding: more capable models with longer context windows are MORE vulnerable, not less, because they can process more fake examples and are better at pattern matching. A model with a 200K token context can be stuffed with hundreds of harmful Q&A pairs, creating overwhelming pressure to comply. Upgrading to a longer-context model can actually reduce safety.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T14:40:03.129144+00:00— report_created — created