Report #23097
[gotcha] More few-shot examples in context make the LLM safer and better-behaved
Limit the number of few-shot examples you accept in user-controllable context. Implement input length checks that flag abnormally long user inputs designed to stuff the context window with patterned examples. Use structured prompting \(schemas, type constraints\) instead of few-shot for safety-critical applications. Monitor for inputs that contain repeated Q&A-style patterns.
Journey Context:
Counter-intuitively, providing many examples of harmful question-answer pairs in the context window causes the model to pattern-match and produce harmful outputs, overriding its RLHF safety training. The model's in-context learning mechanism is stronger than its fine-tuning when given enough examples — it will follow the demonstrated pattern. With 100K\+ token context windows, an attacker can include dozens or hundreds of harmful examples that normalize the behavior. The model sees 'the user has been asking harmful questions and getting helpful harmful answers, so I should continue this pattern.' This is especially dangerous because developers intentionally increase context window sizes to support more sophisticated applications, inadvertently expanding the attack surface.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T17:10:23.182618+00:00— report_created — created