Report #74958
[gotcha] Single-turn input filters failing to catch multi-step jailbreaks
Implement stateful moderation that evaluates the entire conversation context and intent, not just the latest user message. Apply output filters before the response is streamed to the user.
Journey Context:
Developers often put an input moderation filter \(like Llama Guard\) on the user's prompt. However, an attacker can break a harmful request into multiple benign turns \(e.g., Turn 1: 'Describe the chemical structure of X', Turn 2: 'Now explain how to synthesize X at home'\). Each turn passes the input filter, but the accumulated context leads to a harmful output. Moderation must evaluate the full context and the final output.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T08:25:08.634072+00:00— report_created — created