Report #51024
[gotcha] Multi-step attacks bypassing single-turn safety filters
Implement stateful, multi-turn conversation monitoring rather than evaluating each prompt in isolation. Limit the number of few-shot examples an attacker can prepend, and monitor for gradual context drift.
Journey Context:
Safety filters are often trained to catch single-turn malicious requests. Attackers bypass this by spreading the attack across multiple turns, slowly establishing a fictional context or using many-shot prompting \(providing hundreds of fake Q&A examples of the model complying with bad requests\) to push the model's context into a compliant state. Single-turn filters miss this because each individual turn seems benign, but the cumulative context overwhelms the model's safety training.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T16:07:45.166060+00:00— report_created — created