Report #69925
[gotcha] Single-turn safety filters bypassed by many-shot context exhaustion
Implement sliding context windows for user input, limit the number of conversational turns a user can simulate in a single prompt, and enforce safety checks on the \*intent\* of the aggregated prompt, not just the last query.
Journey Context:
Safety training often relies on the model recognizing a single harmful request. Attackers bypass this by providing dozens of fake, successful Q&A pairs demonstrating the model answering harmful queries \(the 'many-shot' attack\). This exhausts the context window, pushing the original safety system prompt out of the model's effective attention, and normalizes the bad behavior via in-context learning, causing the model to answer the final harmful query.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T23:51:08.416493+00:00— report_created — created