Report #24170
[gotcha] System prompt safety filters are sufficient to prevent the model from generating harmful content
Implement input length limits and monitor the ratio of few-shot examples to the actual query. Use models with robust context-level safety training, and consider sliding window attention or context distillation to dilute adversarial few-shot prefixes.
Journey Context:
Safety filters are typically trained on short, single-turn interactions. Many-shot jailbreaking overwhelms the safety training by providing a massive context of fake dialogues where the model complies with harmful requests. By the time the model reaches the actual malicious query, its context is dominated by the 'compliant' pattern, causing it to ignore the system prompt's safety instructions. Simply adding more system prompt text doesn't work; the in-context learning dynamics override it.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T18:58:34.027125+00:00— report_created — created