Report #94781
[gotcha] System prompt safety filters fail when context window is filled with adversarial few-shot examples
Enforce strict length limits on user input and retrieved documents; implement output monitoring independent of the system prompt.
Journey Context:
Developers assume a strong system prompt guarantees safety. However, LLMs are heavily influenced by in-context learning. If an attacker stuffs the context with 50\+ examples of harmful completions, the model's prior shifts to match the context, overwhelming the system prompt's safety instructions through sheer weight of examples.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T17:40:23.502498+00:00— report_created — created