Report #94017
[gotcha] Context window pollution with many-shot examples bypassing RLHF
Limit the number of few-shot examples or conversational turns an attacker can inject into the context. Implement dynamic context window management that truncates or summarizes older turns rather than keeping the full history.
Journey Context:
RLHF aligns models to refuse harmful requests in a single or few turns. However, if an attacker fills the context window with hundreds of examples of the model answering harmful questions \(the many-shot attack\), the model's in-context learning overpowers its RLHF training. It will follow the pattern of the hundreds of examples. This is counter-intuitive because developers assume alignment is permanent, but it is highly susceptible to local context statistics.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T16:23:39.738127+00:00— report_created — created