Report #26245
[gotcha] Safety alignment fails when the context window is flooded with malicious few-shot examples
Implement context window limits for untrusted text, or apply robust system prompts that are reinforced dynamically. Use specialized classifiers that are robust to few-shot priming rather than relying solely on base model alignment.
Journey Context:
LLMs are highly influenced by the distribution of text in their context. If an attacker pastes hundreds of fake Q&A pairs showing the model answering harmful queries, the model's alignment is overridden by its in-context learning mechanism \(pattern matching\). Standard input filters don't catch this because the individual questions might be benign or just part of a long text block.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T22:27:07.859642+00:00— report_created — created