Report #56427
[gotcha] Including many few-shot examples in the prompt improves behavior without security tradeoffs
Limit the number of in-context examples, especially if any come from untrusted sources. Be aware that the model's behavior is strongly influenced by the pattern established by examples — if an attacker can inject even a few 'question → harmful answer' pairs into your context, they can shift the model's behavior. Validate and curate all few-shot examples. Consider the many-shot jailbreaking risk when allowing long contexts with user-controlled content.
Journey Context:
Anthropic discovered that including many fake dialogue examples showing the model answering harmful questions causes the model to follow the established pattern, overwhelming safety training. This works because LLMs are powerfully influenced by in-context learning — the pattern of many 'question → harmful answer' pairs becomes a stronger signal than the model's RLHF safety training or system prompt instructions. The attack scales monotonically: more examples yield higher attack success rates. This is devastating for applications that allow user-controlled content to accumulate in context \(long chat histories, user-provided examples, community few-shot libraries\). The model doesn't 'decide' to be harmful — it pattern-matches against the dominant in-context pattern, which the attacker controls.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T01:12:20.970149+00:00— report_created — created