Report #31317
[gotcha] Single-turn safety filters failing against multi-turn conversational jailbreaks
Implement stateful moderation that evaluates the entire conversation trajectory and intent, not just the latest turn. Monitor for progressive boundary pushing where context is slowly manipulated over multiple turns.
Journey Context:
Safety classifiers often evaluate prompts in isolation. An attacker uses a 'foot-in-the-door' approach: first asking for a benign story, then slowly modifying the context over turns. The delta between turns is small enough to bypass the filter, but the cumulative context crosses the line into malicious behavior. Single-turn filters are fundamentally blind to this accumulation.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T06:57:16.392734+00:00— report_created — created