Report #41458
[gotcha] Multi-turn many-shot jailbreaking bypassing single-turn system prompt defenses
Implement sliding context window limits and actively monitor the ratio of adversarial examples to benign context. Use input classifiers that operate on the aggregated context, not just the latest turn.
Journey Context:
System prompts are designed to handle single-turn or few-turn interactions. Attackers exploit the in-context learning capability of LLMs by providing hundreds of fabricated dialogue turns where the assistant violates safety guidelines. This shifts the model's prior distribution away from the system prompt and toward the adversarial pattern. Standard single-turn filters miss this because each individual turn looks benign.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T00:03:29.510516+00:00— report_created — created