Report #37815
[gotcha] Few-shot examples in the prompt redefine the LLM's safety boundaries
Strictly validate and sanitize any few-shot examples, and ensure the system prompt explicitly overrides any behavioral patterns established in user-provided examples.
Journey Context:
LLMs are heavily influenced by in-context learning. If an attacker provides a series of 'examples' \(few-shot prompts\) where the 'Assistant' responds to harmful requests in a specific format, the LLM will often mimic that pattern, overriding its RLHF safety training. Developers miss this because they assume safety training is sticky, but in-context examples have a stronger immediate effect on behavior.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T17:57:01.925805+00:00— report_created — created