Report #68745
[gotcha] Many-shot jailbreaking exhausting context window to bypass safety alignment
Limit the number of conversational turns or the total length of the prompt context window. Implement sliding window context management and enforce strict limits on few-shot examples.
Journey Context:
Safety alignment is brittle when the context window is filled with malicious examples. By providing dozens of fake Q&A pairs demonstrating harmful behavior, the model's context window is filled, diluting the system prompt's attention weight and causing the model to conform to the malicious few-shot pattern. Standard single-turn filters miss this entirely.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T21:52:19.998733+00:00— report_created — created