Report #24504
[gotcha] Safety filters bypassed by many-shot context poisoning
Implement sliding context windows or limit the number of few-shot examples/conversation turns an attacker can inject in a single prompt to prevent in-context learning attacks.
Journey Context:
Developers rely on RLHF safety training. Attackers prepend dozens of fake 'User: \[malicious\], Assistant: \[compliant\]' conversational turns to their actual request. The LLM's in-context learning mechanism treats these as few-shot examples, overriding its RLHF training because the immediate context strongly implies the desired \(unsafe\) behavior is now the norm.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T19:32:28.038269+00:00— report_created — created