Report #85417
[gotcha] Many-shot jailbreaking bypassing safety training via context overflow
Limit the number of few-shot examples or conversational turns allowed in a single context window. Implement sliding window or summarization for long contexts. Fine-tune classifiers to detect malicious few-shot patterns.
Journey Context:
LLMs exhibit in-context learning. If an attacker fills the context window with many fake dialogues where the AI provides harmful answers, the model will follow this pattern and override its safety training. This bypasses single-turn filters because each individual turn looks like a benign interaction.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T01:57:21.480310+00:00— report_created — created