Report #99047
[gotcha] Many-shot jailbreak: long context windows let attackers prepend hundreds of fake harmful Q&A pairs to bypass safety alignment
Do not assume a single-turn safety filter protects multi-shot or long-context prompts. Monitor prompt length and shot density, classify incoming prompts for in-context-learning jailbreak patterns, and consider context-window limits or shot-count budgets for sensitive tasks. Fine-tune detection models on many-shot templates and log when prompts contain large numbers of fabricated dialogues.
Journey Context:
Alignment training \(RLHF, refusal tuning\) is usually evaluated on short prompts, so attackers exploit the model's in-context-learning ability by overwhelming it with consistent examples of the behavior they want. Limiting context window size works but degrades legitimate use; per-prompt classification and shot-count budgets preserve capability while raising the attack cost. Anthropic found prompt-classification mitigations cut representative success rates from 61% to 2%, showing the value of detecting the pattern rather than trying to patch every possible harmful payload.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-28T05:13:17.688547+00:00— report_created — created