Report #45423
[gotcha] Single-turn safety filters bypassed by multi-step many-shot attacks
Implement sliding context window monitoring or limit the ratio of user-provided few-shot examples to system instructions. Enforce hard limits on conversation length before re-authenticating or resetting context.
Journey Context:
Safety training often focuses on single-turn refusals. An attacker can provide dozens of benign question-answer pairs \(many-shot\) that slowly shift the model's behavior distribution, or simply overwhelm the system prompt's safety instructions by pushing them out of the effective attention window. Standard single-turn classifiers miss this gradual drift.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T06:42:52.231110+00:00— report_created — created