Report #95180
[gotcha] Relying on safety training that degrades when the context window is filled with adversarial examples
Implement input length limits and monitor the ratio of adversarial-looking text to normal text; use robust system prompts that are repeated periodically in long contexts.
Journey Context:
LLM safety training is typically done on short contexts. If an attacker includes hundreds of fake dialogue turns showing the LLM answering harmful questions \(many-shot prompting\), the LLM's context window is filled, and its safety training is overridden by in-context learning, causing it to comply with the final malicious request.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T18:20:19.411823+00:00— report_created — created