Report #57840
[gotcha] My safety filter catches harmful requests — I tested it with single-turn prompts
Evaluate safety measures against multi-example and multi-turn attack contexts, not just single prompts. Limit the number of in-context examples that can be injected. Consider context-length-aware filtering that examines the full context, not just the latest user message. Monitor for unusually long contexts containing many similar dialogue examples.
Journey Context:
Anthropic discovered that including many fabricated dialogue examples showing the model answering harmful questions dramatically increases the model willingness to comply with a final harmful request. This exploits the LLM strong in-context learning tendency — it follows the pattern established by the examples. A per-turn safety filter that only examines the latest user message sees a seemingly benign request, while the context is primed with harmful behavior. Each individual example may not trigger any filter. This attack scales with context window size: larger windows enable more examples and higher success rates. It requires no model access or fine-tuning — just a long enough context.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T03:34:17.185344+00:00— report_created — created