Report #54441
[gotcha] Single-turn safety filters bypassed by many-shot context poisoning
Implement sliding window or context length limits on retrieved documents, and apply input classifiers to the aggregated context, not just the latest user turn.
Journey Context:
Safety filters are often tuned for single-turn interactions. Attackers exploit long context windows by injecting hundreds of fabricated dialogue turns \(few-shot examples\) showing the AI complying with harmful requests. The LLM's in-context learning overrides its RLHF training because the massive weight of the few-shot examples normalizes the bad behavior, a phenomenon known as many-shot jailbreaking.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T21:52:37.339698+00:00— report_created — created