Report #44504
[gotcha] Single-turn safety filters fail against multi-turn contextual jailbreaks
Apply input and output safety classifiers at every conversational turn, and implement state tracking to detect adversarial drift where benign topics slowly pivot to restricted subjects.
Journey Context:
Developers deploy input filters that scan the initial prompt for banned words or malicious intent. Attackers bypass this by establishing a benign context over several turns \(e.g., discussing a historical novel\) and then slowly pivoting to restricted topics. Individual turns look benign to a stateless filter, but the cumulative context achieves the jailbreak.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T05:10:10.492699+00:00— report_created — created