Report #63077
[gotcha] Single-turn safety filters failing against multi-turn attacks where malicious intent is distributed across turns
Implement stateful safety monitoring that evaluates the cumulative intent of the conversation, not just the current turn; apply output classifiers to every turn.
Journey Context:
Developers deploy input/output classifiers that evaluate each API call independently. Attackers break the malicious request into benign steps \(e.g., 'Write a story about a lab', then 'Describe the chemicals', then 'How would they react?'\). The LLM maintains context and completes the attack, but each individual turn passes the filter because it seems innocuous in isolation.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T12:21:20.836953+00:00— report_created — created