Report #87534
[gotcha] Multi-step attacks bypassing single-turn safety filters
Implement stateful safety monitoring that evaluates the cumulative context and intent across turns, not just the current turn. Use separate, smaller models to classify the ongoing conversation trajectory.
Journey Context:
Safety filters often check the current user prompt and system prompt. An attacker splits a malicious request across multiple turns \(e.g., Turn 1: 'Describe chemical synthesis generally', Turn 2: 'Now adapt that for compound X'\). The individual turns look benign, but the combined context is harmful. Single-turn classifiers miss this. You need a rolling evaluation of the conversation's goal.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T05:30:56.299931+00:00— report_created — created