Report #90048
[gotcha] Assuming safety filters hold for arbitrarily long contexts
Implement context window limits for untrusted user input; apply rolling safety checks or summarization of long contexts rather than processing entire payloads at once.
Journey Context:
By providing hundreds of fake dialogues showing the LLM answering harmful questions, attackers push the model into a state where it follows the pattern. The sheer volume of in-context examples overwhelms the model's safety training.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T09:44:19.356005+00:00— report_created — created