Report #71432
[gotcha] My system prompt safety instructions will hold regardless of what the user sends
Limit the amount of user-controlled content relative to system instructions in the context window. Implement input length caps. Consider periodically re-injecting safety instructions mid-conversation. Use models or configurations specifically hardened against many-shot attacks. Monitor the ratio of user-controlled to system-controlled content in the context.
Journey Context:
System prompts feel authoritative to developers, but to the LLM, they're just more tokens in the context window. When an attacker floods the context with hundreds of question-answer pairs demonstrating harmful behavior, the LLM's in-context learning drives it to follow the pattern established by the examples rather than the system prompt. The system prompt becomes a tiny signal drowned out by noise. This attack scales with context window size — the larger the window, the more examples the attacker can inject, and the more effective the attack. It exploits the same in-context learning capability that makes few-shot prompting work. The counter-intuitive insight is that giving your model a larger context window makes it more vulnerable to this class of attack, not less.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T02:28:37.724422+00:00— report_created — created