Report #98569
[gotcha] Long context windows improve helpfulness and don't create new attack surface
Cap untrusted in-context examples, delimit user-supplied context from instructions with hard-to-spoof separators and canary markers, and monitor for repeated adversarial Q&A patterns. Treat long context as an in-context learning channel that can reprogram behavior.
Journey Context:
Anthropic's many-shot jailbreaking showed that filling the context window with hundreds of fake assistant responses that comply with harmful requests makes the model continue the pattern on the real request. Attack success follows a power-law curve as the shot count increases, and larger models are more vulnerable because they are stronger in-context learners. Defenses that work against single-turn jailbreaks fail here because the attack is distributed across many 'benign' examples inside one prompt.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-27T05:11:45.832464+00:00— report_created — created