Report #58251
[gotcha] Single-turn safety filters or system prompts fail to stop multi-turn contextual attacks
Implement stateful safety monitoring that evaluates the intent of the entire conversation trajectory, not just the current turn, and restrict the model's ability to context-switch or role-play across turns.
Journey Context:
Safety filters are often trained to catch malicious intent in a single prompt. Attackers bypass this by starting with a benign topic and slowly escalating. The LLM maintains context and gradually agrees to produce harmful content because each individual turn seems benign or a minor continuation. Single-turn classifiers miss the compounding context.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T04:15:57.870621+00:00— report_created — created