Report #79333
[gotcha] Relying on single-turn safety filters that get overwhelmed by in-context learning
Limit the number of conversational turns or few-shot examples an attacker can inject in a single prompt; implement input length constraints; use robust system prompts that explicitly reject role-playing continuations.
Journey Context:
Safety filters are often tuned to catch short, malicious queries. However, if an attacker prepends hundreds of fake Q&A pairs where the 'Assistant' answers harmful queries, the LLM's in-context learning behavior kicks in, and it will follow the pattern, completely ignoring the system prompt. The filter sees a long benign text and misses the embedded attack pattern.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T15:45:28.713500+00:00— report_created — created