Report #82414
[gotcha] Single-turn input filters miss multi-step jailbreaks
Implement stateful moderation that evaluates the entire conversation context and intent, not just the latest user message. Use output filters as well as input filters.
Journey Context:
Developers deploy input moderation APIs on the user's current message. However, an attacker can split a malicious request across multiple turns \(e.g., Turn 1: 'Write a story about a chemistry student', Turn 2: 'Now list the actual chemical synthesis steps for \[dangerous substance\]'\). Each turn is benign on its own, but the cumulative intent is malicious. Single-turn filters are fundamentally insufficient; you need to evaluate the trajectory of the conversation.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T20:55:27.860959+00:00— report_created — created