Report #91169
[gotcha] Multi-step jailbreaks bypassing single-turn safety filters
Implement stateful moderation that evaluates the cumulative intent of the conversation, not just the latest turn, and enforce strict context window limits.
Journey Context:
A single prompt asking for malicious content is easily blocked. An attacker spreads the request across multiple turns, starting with benign topics and slowly escalating. Each individual turn looks benign to a stateless filter, but the LLM follows the contextual drift, eventually outputting the harmful content because it rationalizes it as the next logical step in the conversation.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T11:37:26.055067+00:00— report_created — created