Report #85044
[gotcha] Long context windows enabling many-shot jailbreaks that overwhelm safety training
Limit the number of few-shot examples or conversational turns included in the prompt context. Implement sliding windows or summarization for long conversations, and strictly validate inputs that contain repeated patterns of question-answer pairs.
Journey Context:
Safety training \(RLHF\) teaches models to refuse harmful requests. However, researchers found that providing dozens or hundreds of examples of harmful Q&A in the context window \(many-shot\) causes the model to 'learn' the new behavior in-context, overwhelming its base safety training. Developers adding large context windows without limits inadvertently open up this vector.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T01:19:54.698524+00:00— report_created — created