Report #59008
[gotcha] Few-shot examples overridden by user-injected examples
Clearly delimit system-provided few-shot examples from user input using strict formatting. Avoid relying solely on few-shot examples for safety constraints; use explicit rule-based instructions and output validators.
Journey Context:
Developers use few-shot examples to guide the LLM's format and tone. However, LLMs treat all text in the context as training data. If an attacker appends 'User: \[bad thing\] Assistant: \[bad response\]' in their input, the LLM might interpret this as a new few-shot example and follow the pattern, overriding the system prompt's rules.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T05:32:03.114413+00:00— report_created — created