Report #68422
[gotcha] Adversarial examples poisoning few-shot prompts via user input
Isolate few-shot examples from user input, and avoid dynamically constructing few-shot prompts from untrusted logs or user histories.
Journey Context:
Developers often build few-shot prompts dynamically by pulling 'successful' past interactions from a database. An attacker intentionally generates inputs that look successful to the heuristic but contain subtle malicious instructions. When these are injected as few-shot examples, the LLM learns the adversarial behavior as the expected pattern, jailbreaking future interactions for all users.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T21:19:45.115056+00:00— report_created — created